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Readings and Learning Objectives 


STUDY SESSION 4 


12. 


13. 


14. 


15. 


Fundamentals of Probability 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 1. 


After completing this reading, you should be able to: 
. describe an event and an event space. 
. describe independent events and mutually exclusive events. 
. explain the difference between independent events and conditionally independent events. 
. calculate the probability of an event for a discrete probability function. 
. define and calculate a conditional probability. 
distinguish between conditional and unconditional probabilities. 
. explain and apply Bayes’ rule. 
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Random Variables 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 2. 


After completing this reading, you should be able to: 

a. describe and distinguish a probability mass function from a cumulative distribution function, and 
explain the relationship between these two. 

b. understand and apply the concept of a mathematical expectation of a random variable. 

c. describe the four common population moments. 

d. explain the differences between a probability mass function and a probability density function. 

e. characterize the quantile function and quantile-based estimators. 

f. explain the effect of a linear transformation of a random variable on the mean, variance, standard 
deviation, skewness, kurtosis, median, and interquartile range. 


Common Univariate Random Variables 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 3. 


After completing this reading, you should be able to: 

a. distinguish the key properties and identify the common occurrences of the following 
distributions: uniform distribution, Bernoulli distribution, binomial distribution, Poisson 
distribution, normal distribution, lognormal distribution, Chi-squared distribution, Student’s t 
and F-distributions. 

b. describe a mixture distribution and explain the creation and characteristics of mixture 
distributions. 


Multivariate Random Variables 

Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 

Chapter 4. 

After completing this reading, you should be able to: 

. explain how a probability matrix can be used to express a probability mass function. 

. compute the marginal and conditional distributions of a discrete bivariate random variable. 

. explain how the expectation of a function is computed for a bivariate discrete random variable. 

. define covariance and explain what it measures. 

. explain the relationship between the covariance and correlation of two random variables and 
how these are related to the independence of the two variables. 

f. explain the effects of applying linear transformations on the covariance and correlation between 

two random variables. 
g. compute the variance of a weighted sum of two random variables. 
h. compute the conditional expectation of a component of a bivariate random variable. 


oan gpp 


i. describe the features of an independent and identically distributed (iid) sequence of random 
variables. 

j. explain how the iid property is helpful in computing the mean and variance of a sum of iid random 
variables. 


STUDY SESSION 5 


16. Sample Moments 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 5. 


After completing this reading, you should be able to: 

. estimate the mean, variance, and standard deviation using sample data. 

. explain the difference between a population moment and a sample moment. 

. distinguish between an estimator and an estimate. 

. describe the bias of an estimator and explain what the bias measures. 

explain what is meant by the statement that the mean estimator is BLUE. 

describe the consistency of an estimator and explain the usefulness of this concept. 
. explain how the Law of Large Numbers (LLN) and Central Limit Theorem (CLT) apply to the 
sample mean. 

. estimate and interpret the skewness and kurtosis of a random variable. 

use sample data to estimate quantiles, including the median. 

estimate the mean of two variables and apply the CLT. 

. estimate the covariance and correlation between two random variables. 

explain how coskewness and cokurtosis are related to skewness and kurtosis. 
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17. Hypothesis Testing 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 6. 


After completing this reading, you should be able to: 

a. construct an appropriate null hypothesis and alternative hypothesis and distinguish between the 
two. 

b. differentiate between a one-sided and a two-sided test and identify when to use each test. 

c. explain the difference between Type I and Type II errors and how these relate to the size and 
power of a test. 

d. understand how a hypothesis test and a confidence interval are related. 

e. explain what the p-value of a hypothesis test measures. 

f. construct and apply confidence intervals for one-sided and two-sided hypothesis tests and 
interpret the results of hypothesis tests with a specific confidence level. 

g. identify the steps to test a hypothesis about the difference between two population means. 

h. explain the problem of multiple testing and how it can lead to biased results. 


STUDY SESSION 6 


18. Linear Regression 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 7. 


After completing this reading, you should be able to: 

a. describe the models that can be estimated using linear regression and differentiate them from 
those which cannot. 

b. interpret the results of an ordinary least squares (OLS) regression with a single explanatory 

variable. 

. describe the key assumptions of OLS parameter estimation. 

d. characterize the properties of OLS estimators and their sampling distributions. 

e. construct, apply, and interpret hypothesis tests and confidence intervals for a single regression 
coefficient in a regression. 

f. explain the steps needed to perform a hypothesis test in a linear regression. 


a 


g. 
h. 


describe the relationship among a t-statistic, its p-value, and a confidence interval. 


estimate the correlation coefficient from the R? measure obtained in linear regressions with a 
single explanatory variable. 


19. Regression with Multiple Explanatory Variables 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 8. 


After completing this reading, you should be able to: 


a. 
b. 


C. 


distinguish between the relative assumptions of single and multiple regression. 

interpret regression coefficients in a multiple regression. 

interpret goodness-of-fit measures for single and multiple regressions, including Rĉ and adjusted 
R°. 

construct, apply, and interpret joint hypothesis tests and confidence intervals for multiple 
coefficients in a regression. 

calculate the regression R? using the three components of the decomposed variation of the 
dependent variable data: the explained sum of squares, the total sum of squares, and the residual 
sum of squares. 


20. Regression Diagnostics 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 9. 


After completing this reading, you should be able to: 


a. 
b. 


explain how to test whether a regression is affected by heteroskedasticity. 

describe approaches to using heteroskedastic data. 

characterize multicollinearity and its consequences, as well as distinguish between 
multicollinearity and perfect collinearity. 


. describe the consequences of excluding a relevant explanatory variable from a model and 


contrast those with the consequences of including an irrelevant regressor. 
explain two model selection procedures and how these relate to the bias-variance trade-off. 
describe the various methods of visualizing residuals and their relative strengths. 


. describe methods for identifying outliers and their impact. 


determine the conditions under which OLS is the best linear unbiased estimator. 


STUDY SESSION 7 


21. Stationary Time Series 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 10. 


After completing this reading, you should be able to: 
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describe the requirements for a series to be covariance stationary. 


. define the autocovariance function and the autocorrelation function. 


define white noise, and describe independent white noise and normal (Gaussian) white noise. 


. define and describe the properties of autoregressive (AR) processes. 


define and describe the properties of moving average (MA) processes. 
explain how a lag operator works. 


. explain mean reversion and calculate a mean-reverting level. 
. define and describe the properties of autoregressive moving average (ARMA) processes. 


describe the application of AR, MA, and ARMA processes. 
describe sample autocorrelation and partial autocorrelation. 


. describe the Box-Pierce Q statistic and the Ljung-Box Q statistic. 


explain how forecasts are generated from ARMA models. 


m. describe the role of mean reversion in long-horizon forecasts. 


n. 


explain how seasonality is modeled in a covariance-stationary ARMA. 


22. Non-Stationary Time Series 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 


23: 


24. 


29: 


26. 


Chapter 11. 


After completing this reading, you should be able to: 

a. describe linear and nonlinear time trends. 

b. explain how to use regression analysis to model seasonality. 

c. describe a random walk and a unit root. 

d. explain the challenges of modeling time series containing unit roots. 

e. describe how to test if a time series contains a unit root. 

f. explain how to construct an h-step-ahead point forecast for a time series with seasonality. 
g. calculate the estimated trend value and form an interval forecast for a time series. 


Measuring Returns, Volatility, and Correlation 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 12. 


After completing this reading, you should be able to: 

a. calculate, distinguish, and convert between simple and continuously compounded returns. 

b. define and distinguish among volatility, variance rate, and implied volatility. 

c. describe how the first two moments may be insufficient to describe non-normal distributions. 

d. explain how the Jarque-Bera test is used to determine whether returns are normally distributed. 

e. describe the power law and its use for non-normal distributions. 

f. define correlation and covariance and differentiate between correlation and dependence. 

g. describe properties of correlations between normally distributed variables when using a one- 
factor model. 

h. compare and contrast the different measures of correlation used to assess dependence. 


Simulation and Bootstrapping 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 13. 


After completing this reading, you should be able to: 

describe the basic steps to conduct a Monte Carlo simulation. 

. describe ways to reduce Monte Carlo sampling error. 

explain the use of antithetic and control variates in reducing Monte Carlo sampling error. 
describe the bootstrapping method and its advantage over Monte Carlo simulation. 
describe pseudo-random number generation. 

describe situations where the bootstrapping method is ineffective. 

. describe the disadvantages of the simulation approach to financial problem solving. 
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Machine Learning Methods 
Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 14. 


After completing this reading, you should be able to: 

a. discuss the philosophical and practical differences between machine learning techniques and 
classical econometrics. 

b. compare and apply the two methods utilized for rescaling variables in data preparation. 

c. explain the differences among the training, validation, and test data sub-samples, and how each is 
used. 

d. understand the differences between and consequences of underfitting and overfitting, and 

propose potential remedies for each. 

use principal components analysis to reduce the dimensionality of a set of features. 

describe how the K-means algorithm separates a sample into clusters. 

. describe natural language processing and how it is used. 

. differentiate among unsupervised, supervised, and reinforcement learning models. 

explain how reinforcement learning operates and how it is used in decision-making. 
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Machine Learning and Prediction 

Global Association of Risk Professionals. Quantitative Analysis. New York, NY: Pearson, 2023. 
Chapter 15. 

After completing this reading, you should be able to: 

a. explain the role of linear regression and logistic regression in prediction. 

b. evaluate the predictive performance of logistic regression models. 

c. understand how to encode categorical variables. 


d. discuss why regularization is useful, and distinguish between the ridge regression and LASSO 
approaches. 
show how a decision tree is constructed and interpreted. 
describe how ensembles of learners are built. 
explain the intuition and processes behind the K nearest neighbors and support vector machine 
methods for classification. 
understand how neural networks are constructed and how their weights are determined. 
compare the logistic regression and neural network classification approaches using a confusion 


matrix. 
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The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 1. 


READING 12 
FUNDAMENTALS OF PROBABILITY 


Study Session 4 


EXAM FOCUS 


This reading covers important terms and concepts associated with probability theory. 
Specifically, we will examine the difference between independent and mutually 
exclusive events, discrete probability functions, and the difference between 
unconditional and conditional probabilities. Bayes’ rule is also examined as a way to 
update a given set of prior probabilities. For the exam, be able to calculate conditional 
probabilities, joint probabilities, and probabilities based on a probability function. Also, 
understand when and how to apply Bayes’ formula. 


MODULE 12.1: BASICS OF PROBABILITY 


When an outcome is unknown, such as the outcome (realization) of the flip of a coin or 
the high temperature tomorrow in Dubai, we refer to it as a random variable. We can 
describe a random variable with the probabilities of its possible outcomes. For the flip 
of a fair coin, we refer to the probability of heads as P(heads), which is 50%. We can 
think of a probability as the likelihood that an outcome will occur. If we flip a fair coin 
100 times, we expect that on average it will be heads 50 times. 


A probability equal to 0 for an outcome means that the outcome will not happen. A 
probability equal to 1 for an outcome means it will happen with certainty. Probabilities 
cannot be less than 0 or greater than 1. 


The probability that a random variable will have a specific outcome, given that some 
other outcome has occurred, is referred to as a conditional probability. The 
probability that A will occur, given that B has occurred, is written as P(A|B). For 
example, the probability that a day’s high temperature in Seattle will be between 70 
and 80 degrees is an unconditional probability (i.e., marginal probability). The 
probability that the high temperature will be between 70 and 80 degrees, given that the 
sky is cloudy that day, is a conditional probability. 


The probability that both A and B will occur is written P(AB) and referred to as the 
joint probability of A and B (both occurring). 


Events and Event Spaces 


LO 12.a: Describe an event and an event space. 


An event is a single outcome or a combination of outcomes for a random variable. 
Consider a random variable that is the result of rolling a fair six-sided die. The 
outcomes with positive probability (those that may happen) are the integers 1, 2, 3, 4, 5, 
and 6. For the event x = 3, we can write P(3) = 1/6 = 16.7%. Other possible events 
include getting a 3 or 4, P(3 or 4) = 2/6 = 33.3%, and getting an even number, P(x is 
even) = P(x = 2, 4, or 6) = 3/6 = 50%. The probability that the realization of this random 
variable is equal to one of the possible outcomes (x = 1, 2, 3, 4, 5, or 6) is 100%. 


The event space for a random variable is the set of all possible outcomes and 
combinations of outcomes. Consider a flip of a fair coin. The event space is heads, tails, 
heads and tails, and neither heads nor tails. P(heads) and P(tails) are both 50%. The 
probability of both heads and tails is zero, as is the probability of neither heads nor 
tails. 


LT PROFESSOR'S NOTE 
ê The notation P(A U B) is sometimes used to mean the probability of A or B, 
and the notation P(A N B) is sometimes used to mean the probability of A and 
B. 


Independent and Mutually Exclusive Events 


LO 12.b: Describe independent events and mutually exclusive events. 


Two events are independent events if knowing the outcome of one does not affect the 
probability of the other. When two events are independent, the following two 
probability relationships must hold: 


1. P(A) x P(B) = P(AB). The probability that both A and B will happen is the product of 
their unconditional probabilities. 


2. P(A|B) = P(A). The conditional probability of A given that B occurs is simply the 
unconditional probability of A occurring. This means B occurring does not change 
the probability of A. 


Consider flipping a coin twice. Getting heads on the first flip does not change the 
probability of getting heads on the second flip. The two events are independent. In this 
case, the joint probability of getting heads on both flips is simply the product of their 
unconditional expectations. Given that the probability of getting heads is 50%, the 
probability of getting heads on two flips in a row is 0.5 x 0.5 = 25%. 


If Aj, A», .... A, are independent events, their joint probability P(A, and A, ... and A,) is 
equal to P(A,) x P(A.) x ... x P(A,). 


Two events are mutually exclusive events if they cannot both happen. Consider the 
possible outcomes of one roll of a die. The events “x = an even number” and “x = 3” are 


mutually exclusive; they cannot both happen on the same roll. 


In general, P(A or B) = P(A) + P(B) - P(AB). We must subtract the probability of both A 
and B happening to avoid counting those outcomes twice. If the probability that one 
stock will rise tomorrow, P(A), is 60% and the probability that another stock will rise 
tomorrow, P(B), is 55%, we cannot calculate the probability that both will rise 
tomorrow as 60% + 55% = 115%. We must subtract the joint probability that both 
stocks will rise to get P(A or B). 


When events A and B are mutually exclusive, P(AB) is zero, so P(A or B) is simply P(A) 
+ P(B). 


Conditionally Independent Events 


LO 12.c: Explain the difference between independent events and conditionally 
independent events. 


Two conditional probabilities, P(A|C) and P(B|C), may be independent or dependent 
regardless of whether the unconditional probabilities, P(A) and P(B), are independent 
or not. When two events are conditionally independent events, P(A|C) x P(B|C) = 
P(AB|C). 


Consider Event A, “scores above average on an exam,” and Event B, “is taller than 
average.” For a population of grade school students, these events may not be 
independent, as taller students are older on average and likely in a higher grade. Taller 
students may well do better on a given exam than shorter (younger) students. If we add 
the conditioning Event C “age equals 8,” we may find that height and exam scores are 
independent, that is, P(A|C) and P(B|C) are independent while P(A) and P(B) are not. 


=) MODULE QUIZ 12.1 
— 1. For the roll of a fair six-sided die, how many of the following are classified as 
events? 
= The outcome is 3. 
= The outcome is an even number. 
a The outcome is not 2, 3, 4,5, or 6. 
A. One. 
B. Two. 
C. Three. 
D. None. 


2. Which of the following equalities does not imply that the events A and B are 
independent? 
A. P(AB) = P(A) x P(B). 
B. P(A or B) = P(A) + P(B) - P(AB). 
C. P(A|B) = P(A). 
D. P(AB) / P(B) = P(A). 


3. Two independent events: 


A. must be conditionally independent. 
B. cannot be conditionally independent. 


C. may be conditionally independent or not conditionally independent. 
D. are conditionally independent only if they are mutually exclusive events. 


MODULE 12.2: CONDITIONAL, UNCONDITIONAL, 
AND JOINT PROBABILITIES 


Discrete Probability Function 


LO 12.d: Calculate the probability of an event for a discrete probability function. 


A discrete probability function is one for which there are a finite number of possible 
outcomes. The probability function gives us the probability of each possible outcome. 
Consider a random variable for which the possible outcomes are x = 1, 2, 3, or 4, witha 
probability function of x/10 so that P(x) = x/10. The probability of an outcome of 3 is 
3/10 = 30%. The probability of an outcome of either 2 or 4 is 2/10 + 4/10 = 60%. This 
function qualifies as a probability function because the probability of getting one of the 
possible outcomes is 1/10 + 2/10 + 3/10 + 4/10 = 10/10 = 100%. 


Conditional and Unconditional Probabilities 


LO 12.e: Define and calculate a conditional probability. 


LO 12.f: Distinguish between conditional and unconditional probabilities. 


Sometimes we are interested in the probability of an event, given that some other event 
has occurred. As mentioned earlier, we refer to this as a conditional probability, 
P(A|B). 


Consider conditional probabilities that an employee at Acme, Inc., earns more than 
$40,000 per year, P(40+), conditioned on the highest level of education an employee has 
attained. Employees fall into one of three education levels: no degree (ND), bachelor’s 
degree (BD), and higher-than-bachelor’s degree (HBD). If 60% of the employees have no 
degree, 30% of the employees have attained only a bachelor’s degree, and 10% have 
attained a higher degree, we write P(ND) = 60%, P(BD) = 30%, and P(HBD) = 10%. 


Note that the three levels of education attainment are mutually exclusive; an employee 
can only be in one of the three categories of educational attainment. Note also that the 
three categories are also exhaustive; the categories cover all the possible levels of 
educational attainment. We can write this as P(ND or BD or HBD) = 100%. 


Given a conditional probability and the unconditional probability of the conditioning 
event, we can calculate the joint probability of both events using P(AB) = P(A|B) x 
P(B). Assume that for Acme, 10% of the employees with no degree, 70% of the 
employees with only a bachelor’s degree, and 100% of employees with a degree beyond 
a bachelor’s degree earn more than $40,000 per year. That is, P(40+|ND) = 10%, 
P(40+|BD) = 70%, and P(40+|HBD) = 100%. 


Using these conditional probabilities, along with the unconditional probabilities P(ND) 
= 60%, P(BD) = 30%, and P(HBD) = 10%, we can calculate the joint probabilities: 


P(40+ and ND) = 10% x 60% = 6% 

P(40+ and BD) = 70% x 30% = 21% 

P(40+ and HBD) = 100% x 10% = 10%. 
We can use these probabilities to illustrate the total probability rule, which states 
that if the conditioning events B; are mutually exclusive and exhaustive then: 


P(A) = P(A|B,)P@,) + P(A | B,)P(B,) + .... + P(A IB) P®,) 


This is the sum of the joint probabilities. For Acme, we have P(40+) = 6% + 21% + 10% 
= 37% of the employees earn more than $40,000 per year. 
Rearranging P(AB) = P(A |B) x P(B), we get: 
P(AB) 
P(A | B) =—— 
P(B) 
That is, we can calculate a conditional probability from the joint probability of two 
events and the unconditional probability of the conditioning event. As an example, the 
conditional probability is P(40+|BD) is: 
P(40 + and BD) 21% 
PBD) 30% 


= 70% 


Bayes’ Rule 


LO 12.g: Explain and apply Bayes’ rule. 


Bayes’ rule allows us to use information about the outcome of one event to improve 
our estimates of the unconditional probability of another event. 


From our rules of probability, we know that P(A|B) x P(B) = P(AB) and that P(BJA) x 
P(A) = P(AB), so we can write P(A|B) x P(B) = P(BJA) x P(A). Rearranging these terms, 
we Can arrive at Bayes’ rule: 


P(B| A) x P(A) 
P(A |B) =— 
P(B) 


Given the unconditional probabilities of A and B and the conditional probability of B 
given A, we can calculate the conditional probability of A given B. The following 
example illustrates the use of Bayes’ rule and provides some intuition about what this 
formula is telling us. 


EXAMPLE: Bayes’ formula 


There is a 60% probability the economy will outperform, and if it does, there is a 
70% probability a stock will go up and a 30% probability the stock will go down. 
There is a 40% probability the economy will underperform, and if it does, there is a 
20% probability the stock in question will increase in value (have gains) and an 80% 


probability it will not. Given that the stock increased in value, calculate the 
probability that the economy outperformed. 


Answer: 


42% (outperform + gains) 


18% (outperform + no gains) 


8% (underperform + gains) 


32% (underperform + no gains) 


In the earlier figure, we have multiplied the probabilities to calculate the 
probabilities of each of the four outcome pairs. Note that these sum to 1. Given that 
the stock has gains, what is our updated probability of an outperforming economy? 
We sum the probability of stock gains in both states (outperform and underperform) 


to get 42% + 8% = 50%. Given that the stock has gains, the probability that the 
economy has outperformed is: 


The numerator for the calculation of the updated probability P(A|B) using Bayes’ 
formula in the example is the joint probability of outperform and gains. This is 
calculated as P(gains|outperform) x P(outperform) (i.e., 0.7 x 0.6 = 0.42). The 
denominator is the unconditional probability of gains, P(gains|outperform) + 
P(gains|underperform) (i.e., 0.42 + 0.08 = 0.50). 


EXAMPLE: Probability concepts and relationships 


A shipment of 1,000 cars has been unloaded into a parking area. The cars have the 
following features: 


= There are 600 blue (B) cars. 

= Of the blue cars, 150 have driver assist (DA) technology. 
= There are 400 red (R) cars. 

= Of the red cars, 200 have DA technology. 

Given these facts, calculate the following: 

1. Unconditional probabilities: P(B) and P(R) 

2. Conditional probabilities: P(DA|B) and P(DA|R) 

3. Joint probabilities: P(B and DA) and P(R and DA) 

4. Total probability rule: P(DA) 


5. Bayes’ rule: P(B|DA) 
Answer: 


Unconditional probabilities: 


P(B) = 600/1,000 = 60% 
P(R) = 400/1,000 = 40% 


Conditional probabilities: 


P(DA|B) = 150/600 = 25% 
P(DA|R) = 200/400 = 50% 


Joint probabilities: 


P(B and DA) = P(DA|B)P(B) = 25%(60%) = 15%; 15%(1,000) = 150 of the cars are 
blue with driver assist 


P(R and DA) = P(DA|R)P(R) = 50%(40%) = 20%; 20%(1,000) = 200 of the cars are 
red with driver assist 


Total probability rule: 


P(DA) = P(DA|B)P(B) + P(DA|R)P(R) = 25% (60%) + 50%(40%) = 35%; 35% 
(1,000) = 350 of the cars have driver assist 


Bayes’ rule: 


P(B|DA) = P(B and DA)/P(DA) = 15% /35% = 42.9%; 350 cars have driver assist 
and of those cars, 150 are blue: 150/350 = 0.42857 = 42.9% 


Independence: 


Now, assume we add to our information that 40% of the blue cars (240) are 
convertibles and 40% of the red cars (160) are convertibles, so that 400 of the cars 
are convertibles. In this case, P(B|C) = 240/400 = 60% = P(B) and P(R|C) = 160/400 
= 40% = P(R). This meets the requirement for independence that P(A|B) = P(A). The 
fact that a car chosen at random is a convertible gives us no additional information 
about whether a car is blue or red. 


MODULE QUIZ 12.2 
1. The probability function for the outcome of one roll of a six-sided die is given as P(X) 
= x/21. What is P(x > 4)? 
A. 16.6%. 
B. 23.8%. 
C. 33.3%. 
D. 52.4%. 


2. The relationship between the probability that both Event A and Event B will occur 
and the conditional probability of Event A given that Event B occurs is: 


A. P(AB) = P(A|B)P(B). 
_PIALB) 
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P(AB) 
= MAIB 
D. P(AB) = P(A|B)P(A). 
3. The probability that shares of Acme will increase in value over the next month is 
50% and the probability that shares of Acme and shares of Best will both increase in 


value over the next month is 40%. The probability that Best shares will increase in 
value, given that Acme shares increase in value over the next month, is closest to: 
A. 20%. 

B. 40%. 

C. 80%. 

D. 90%. 


KEY CONCEPTS 


LO 12.a 


An event is one of the possible outcomes or a subset of the possible outcomes of a 
random event, such as the flip of a coin. The event space is all the subsets of possible 
outcomes and the empty set (none of the possible outcomes). 


C. PIA) 


LO 12.b 

Two events are independent if either of the following conditions hold: 
= P(A) x P(B) = P(AB) 

= P(A|B) = P(A) 


Two events are mutually exclusive if the joint probability, P (AB) = 0 (i.e. both cannot 
occur). When two events are mutually exclusive, P(A or B) = P(A) + P(B). 


LO 12.c 

If two events conditional on a third event are independent, we say they are 
conditionally independent. For example, if P(AB|C) = P(A|C) P(B|C), then A and B are 
conditionally independent. Two events may be independent but conditionally 
dependent, or vice versa. 


LO 12.d 
A probability function describes the probability for each possible outcome for a 


discrete probability distribution. For example, P(x) = x/25, defined over the outcomes 
{1, 2, 3, 4, 5}. 


LO 12.e 

The joint probability of two events, P (AB), is the probability that they will both occur: 
P(AB) = P(A|B) x P(B). This relationship can be rearranged to define the conditional 
probability of A given B as follows: 


P(AB) 
P(A|B) = —— 
P(B) 


LO 12.f 


An unconditional probability (i.e, marginal probability) is the probability of an event 
occurring. 


A conditional probability, P(A|B), is the probability of an Event A occurring given that 
Event B has occurred. 


LO 12.g 
Bayes’ rule is: 
P(AB) 
P(A|B) = — 
P(B) 


This formula allows us to update the unconditional probability, P(A), based on the fact 
that B has occurred. P(AB) can be calculated as P(B|A)P(A). 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 12.1 


1. C All of the outcomes and combinations specified are included in the event space for 
the random variable. (LO 12.a) 


2.B P(A or B) = P(A) + P(B) - P(AB) holds for both independent and dependent events. 
The other equalities are only true for independent events. (LO 12.b) 


3.C Two independent events may be conditionally independent or not conditionally 
independent. (LO 12.c) 


Module Quiz 12.2 


1.D The probability of x > 4 is the probability of an outcome of 5 or 6 (5/21 + 6/21 = 
52.4%). 


(LO 12.d) 


2.A The (joint) probability that both A and B will occur is equal to the conditional 
probability of Event A given that Event B has occurred, multiplied by the 
unconditional probability of Event B. (LO 12.e) 


3.C Bayes’ formula tells us that: 


P(B) 
Applying that to the information given, we can write: 


P(Best increases and Acme increases) 
P(Best increases | Acme increases) = _-— 
P(Acme increases) 


40% /50% = 80% 
(LO 12.g) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 2. 


READING 13 
RANDOM VARIABLES 


Study Session 4 


EXAM FOCUS 


This reading addresses the concepts of expected value, variance, skewness, and kurtosis. 
The characteristics and calculations of these measures will be discussed. For the exam, 
be able to distinguish among a probability mass function, a cumulative distribution 
function, and a probability density function. Also, be able to compute expected value, 
and be able to identify the four common population moments of a statistical 
distribution. 


MODULE 13.1: PROBABILITY MASS FUNCTIONS, 
CUMULATIVE DISTRIBUTION FUNCTIONS, AND 
EXPECTED VALUES 


Random Variables and Probability Functions 


LO 13.a: Describe and distinguish a probability mass function from a cumulative 
distribution function, and explain the relationship between these two. 


A discrete random variable is one that can take on only a countable number of 
possible outcomes. It can take on only two possible values, zero and one, and is referred 
to as a Bernoulli random variable. We can model the outcome of a coin flip as a 
Bernoulli random variable where heads = 1 and tails = 0. The number of days in June 
that will have a temperature greater than 70 degrees is also a discrete random variable. 
The possible outcomes are the integers from 0 to 30. 


A continuous random variable has an uncountable number of possible outcomes. The 
amount of rainfall that will fall in June is an example of a continuous random variable. 
There are an infinite number of possible outcomes because for any two values (e.g., 6.95 
inches and 6.94 inches), we can find a number between them [e.g, (6.95 + 6.94) / 2 = 
6.945]. Because there are an infinite number of possible outcomes, the probability of 
any single value is zero. For continuous random variables, we measure probabilities 


only over some positive interval, (e.g., the probability that rainfall in June will be 
between 6.94 and 6.95 inches). 


A probability mass function (PMF), f(x) = P(X = x), gives us the probability that the 
outcome of a discrete random variable, X, will be equal to a given number, x. For a 
Bernoulli random variable for which the P(x = 1) = p, the PMF is f (x) = p* (1 - p)!*. 
This yields P(x = 1) = p and P(x=0)=1-p. 

A second example of a PMF is f(x) = 1/6, which is the probability that one roll of a six- 


sided die will take on one of the possible outcomes one through six. Each of the 
possible outcomes has the same probability of occurring (1/6 = 16.67%). 


A third example is the PMF f (x) = x/10 for a random variable that can take on values of 
1, 2, 3, or 4. For example, P(x = 3) = f (3) = 3/10 = 30%. 


For all of these PMFs, the sum of the probabilities of all of the possible outcomes is 
100%, a requirement for a PME 


A cumulative distribution function (CDF) gives us the probability that a random 
variable will take on a value less than or equal to x [i.e., F(x) = P(X < x)|]. 


For a Bernoulli random variable with possible outcomes of zero and one, the CDF is: 

0 x < 0 

F(x) = 1—p O<x<!] 

I x21 
While the PMF for this Bernoulli variable is defined only for X = 0 or 1, the 
corresponding CDF is defined for all real numbers. For example, P(X < 0.1456) = 
F(0.1456) = 1 - p. 
For the roll of a six-sided die, the CDF is F(x) = x/6, so that the probability ofa roll of 3 
or less is F(3) = 3/6 = 50. This illustrates an important relationship between a PMF and 
its corresponding CDF; the probability of an outcome less than or equal to x is simply 


the sum of the probabilities of all the possible outcomes less than or equal to x. For the 
roll of a six-sided die. F(3) = f (1) + f (2) +f (3) = 1/6 + 1/6 + 1/6 = 3/6 = 50%. 


Expectations 


LO 13.b: Understand and apply the concept of a mathematical expectation of a 
random variable. 


The expected value is the weighted average of the possible outcomes of a random 
variable, where the weights are the probabilities that the outcomes will occur. The 
mathematical representation for the expected value of random variable X is: 


E(X) = EP(x;)x= P(X ,)X; + P(x,)x+ et P(x, )x,, 
Here, E is referred to as the expectations operator and is used to indicate the 
computation of a probability-weighted average. The symbol x, represents the first 
observed value (observation) for random variable X; x, is the second observation, and 
so on through the nth observation. The concept of expected value may be demonstrated 


using probabilities associated with a coin toss. On the flip of one coin, the occurrence of 
the event “heads” may be used to assign the value of one to a random variable. 
Alternatively, the event “tails” means the random variable equals zero. Statistically, we 
would formally write the following: 


if heads, then X = 1 
if tails, then X = 0 


For a fair coin, P(heads) = P(X = 1) = 0.5, and P(tails) = P(X = 0) = 0.5. The expected 
value can be computed as follows: 


E(X) = DP(x,)x, = P(X = 0)(0) + P(X = 1)(1) = (0.5)0) + (0.5)(1) = 0.5 


In any individual flip of a coin, X cannot assume a value of 0.5. Over the long term, 
however, the average of all the outcomes is expected to be 0.5. Similarly, the expected 
value of the roll of a fair die, where X = number that faces up on the die, is determined 
to be: 


E(X) = EP(x,)x; = (1/6)(1) + (1/6)(2) + (1/6)(3) + (1/6)(4) + (1/6)(5) + (1/6)(6) 
E(X) = 3.5 


We can never roll a 3.5 ona die, but over the long term, 3.5 should be the average value 
of all outcomes. 


The expected value is, statistically speaking, our best guess of the outcome of a random 
variable. While a 3.5 will never appear when a die is rolled, the average amount by 
which our guess differs from the actual outcomes is minimized when we use the 
expected value calculated this way. 


Note that the probabilities of the outcomes for a coin flip (0 or 1) and the probabilities 
of the outcomes for the roll of a die are equal for all of the possible outcomes in both 
cases. When outcomes are equally likely, the expected value is simply the mean 
(average) of the outcomes: 


1+0 E Vei 
—— = 0.5 for a coin flip 
ir 


1+2+3+4+5+6 
——<— = 3.5 forthe roll of a die 

When we estimate the expected value of a random variable based on n observations, we 
use the mean of the observed values as our estimate of the mean of the underlying 
probability distribution. In terms of a probability model, we are assuming that the 
outcomes are equally likely, that is, each has a probability of 1/n. Multiplying each 
outcome by 1/n and then summing them, produces the same expected value as dividing 
the sum of the outcomes by n. 


In other cases, the probabilities of the outcomes are not equal and we calculate the 
expected value as the weighted sum of the outcomes, where the weights are the 
probabilities of each outcome. The following example illustrates such a case. 


EXAMPLE: Expected earnings per share (EPS) 


The probability distribution of EPS for Ron’s Stores is given in the following figure. 
Calculate the expected earnings per share. 


EPS Probability Distribution 


Probability EPS 


10% £1.80 

20% £1.60 

40% £1.20 

30% £1.00 
100% 
Answer: 


The expected EPS is simply a weighted average of each possible EPS, where the 
weights are the probabilities of each possible outcome. 


E(EPS) = 0.10(1.80) + 0.20(1.60) + 0.40(1.20) + 0.30(1.00) = £1.28 


The following are two useful properties of expected values: 


1. If c is any constant, then: 


E(cX) = cE(X) 
2. If X and Y are any random variables, then: 


E(X + Y) = E(X) + E(Y) 


2) MODULE QUIZ 13.1 

` 4. The probability mass function (PMF) for a discrete random variable that can take on 
the values 1, 2, 3, 4, or 5 is P(X = x) = x/15. The value of the cumulative distribution 
function (CDF) of 4, F(A), is equal to: 

A. 26.7%. 

B. 40.0%. 

C. 66.7%. 

D. 75.0%. 


2. An analyst has estimated the following probabilities for gross domestic product 
growth next year: 
P(4%) = 10%, P(3%) = 30%, P(2%) = 40%, P(1%) = 20% 
Based on these estimates, the expected value of GDP growth next year is: 
A. 2.0%. 
B. 2.3%. 
C. 2.5%. 


D. 2.8%. 


MODULE 13.2: MEAN, VARIANCE, SKEWNESS, AND 
KURTOSIS 


LO 13.c: Describe the four common population moments. 


The population moments most often used are 
= mean; 

a variance; 

=» skewness; and 


a kurtosis. 


The first moment, the mean of a random variable, is its expected value, E(X), which we 
discussed previously. The mean can be represented by the Greek letter (mu). 


The other three moments are central moments because the functions involve the 
random variable minus its mean, X - pu. Subtracting the mean produces functions that 
are unaffected by the location of the mean. These moments give us information about 
the shape of a probability distribution around its mean. 


PROFESSOR'S NOTE 
“ Since central moments are measured relative to the mean, the first central 
moment equals zero and is, therefore, not typically used. 


The second central moment of a random variable is its variance, oĉ. Variance is defined 
as: 


o? = E{[X — E(X)P} = E[(X — p)"] 


Squaring the deviations from the mean ensures that o° is positive. Variance gives us 
information about how widely dispersed the values of the random variable are around 
the mean. 


We often use the square root of variance, o, as a measure of dispersion because it has 
the same units as the random variable. If our distribution is for percentage rates of 
return, the standard deviation is also measured in terms of percentage returns. 


The third central moment of a distribution is: 
E{[X — E(X)}*} = E[X — py] 
Skewness, a measure of a distribution’s symmetry, is the standardized third moment. 
We standardize it by dividing it by the standard deviation cubed. 
E(X u)?] 
skewness = —— 
ga? 


Because we both subtract the mean and divide by standard deviation cubed, skewness 
is unaffected by differences in the mean or in the variance of the random variable. This 


allows us to compare skewness of two different distributions directly. A distribution 
with skew = 0 is perfectly symmetric. 


The fourth central moment of a distribution is: 
E{[X — E(X)]*} = E[(X — p)4] 

Kurtosis is the standardized fourth moment. 

E[(X — »)*] 


gł 


kurtosis = 


Kurtosis is a measure of the shape of a distribution, in particular the total probability in 
the tails of the distribution relative to the probability in the rest of the distribution. The 
higher the kurtosis, the greater the probability in the tails of the distribution. We 
sometimes refer to distributions with high kurtosis as fat-tailed distributions. 


The following figures illustrate the concepts of skewness and kurtosis for a probability 
distribution. 


Figure 13.1: Skewness 
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Figure 13.2: Kurtosis 
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=) MODULE QUIZ 13.2 


= 1. For two financial securities with distributions of returns that differ only in their 
kurtosis, the one with the higher kurtosis will have: 
A. a wider dispersion of returns around the mean. 
B. a greater probability of extreme positive and negative returns. 
C. less peaked distribution of returns. 
D. a more uniform distribution. 


MODULE 13.3: PROBABILITY DENSITY FUNCTIONS, 
QUANTILES, AND LINEAR TRANSFORMATIONS 


Probability Density Functions 


LO 13.d: Explain the differences between a probability mass function and a 
probability density function. 


Recall that we used a PMF to describe the probabilities of the possible outcomes for a 
discrete random variable. A simple example is P(X = x) = f (x) = x/10 for the possible 
outcomes 1, 2, 3, and 4. The PMF tells us the probability of each of those possible 
outcomes, P(X = 4) = 4/10 = 40%. 


Recall that a continuous random variable can take on any of an infinite number of 
possible outcomes so that the probability of any single outcome is zero. We describe a 
continuous distribution function with a probability density function (PDF), rather 
than a PMF. A PDF allows us to calculate the probability of an outcome between two 
values (over an interval). This probability is the area under the PDF over the interval. 
Mathematically, we take the integral of the PDF over an interval to calculate the 
probability that the random variable will take on a value in that interval. 


Quantile Functions 
LO 13.e: Characterize the quantile function and quantile-based estimators. 
The quantile function, Q(a), is the inverse of the CDF. Recall that a CDF gives us the 


probability that a random variable will be less than or equal to some value X = x. The 
interpretation of the CDF is the same for discrete and continuous random variables. 


Consider a CDF that gives us a probability of 30% that a continuous random variable 
takes on values less than 2 [i.e., P(X < 2) = F(2) = 30%]. The quantile function, Q(30%), 
for this distribution would return the value 2; 30% of the outcomes are expected to be 
less than 2. A common use of quantiles is to report the results of standardized tests. 
Consider a student with a score of 122 on an exam. If the student’s quantile score is 
74%, this indicates that that the student’s score of 122 was higher than 74% of those 
who took the test. The quantile function, Q(74%), would return the student’s score of 
122. 


Two quantile measures are of particular interest to us here. One is the value of the 
quantile function for 50%. This is termed the median of the distribution. On average, 
50% of the variable’s outcomes will be below the median and 50% of the variable’s 
outcomes will be above the median. For a symmetric distribution (skew = 0), the mean 
and median will be equal. For a distribution with positive (right) skew, the median will 
be less than the mean, but will be greater than the mean for distributions with negative 
(left) skew. 


The second quantile measure of interest here is the interquartile range (IQR). The 
interquartile range is the upper and lower value of the outcomes of a random variable 
that include the middle 50% of its probability distribution. The lower value is Q(25%) 
and the upper value is Q(75%). The lower value is the value that we expect 25% of the 
outcomes to be less than, and the upper value is the value that we expect 75% of the 
values to be less than. Like standard deviation, the interquartile range is a measure of 
the variability of a random variable. Compared to a given distribution, the outcomes of 
a distribution with a lower interquartile range are more concentrated around the mean, 
just as they are for a distribution with a lower standard deviation. 


Linear Transformations of Random Variables 


LO 13.f: Explain the effect of a linear transformation of a random variable on the 
mean, variance, standard deviation, skewness, kurtosis, median, and 
interquartile range. 


A linear transformation of a random variable, X, takes the form Y = a + bX, where a 
and b are constants. The constant a shifts the location of the random variable, X, and b 
rescales the values of X. The relationships between the moments of the distribution of X 
and the moments of the distribution of Y, a linear transformation of X, are as follows: 


= The mean of Y can be calculated as E(Y) = a + bE(X), both the location and the scale 
are affected. 


= The variance of Y can be calculated as sẹ = b*o%; while a shifts the location of the 
distribution, it does not affect the dispersion around the mean which is rescaled by b. 
The standard deviation of Y is simply oy = yb202 = |bicy. 


= With b > 0 (an increasing transformation), the skew is unaffected, skew Y = skew X. 


= With b <0 (a decreasing transformation), the magnitude of the skew is unaffected, 
but the sign is changed, skew Y = -skew X. 


= A linear transformation of X does not affect kurtosis, kurtosis Y = kurtosis X. 


=) MODULE QUIZ 13.3 


—* 4. Which of the following regarding a probability density function (PDF) is correct? A 
PDF: 


A. provides the probability of each of the possible outcomes of a random variable. 
B. can provide the same information as a cumulative distribution function (CDF). 
C. describes the probabilities for any random variable. 
D. only applies to a discrete probability distribution. 

2. For the quantile function, Q(x): 


A. the CDF function F[Q(23%)] = 23%. 

B. Q(23%) will identify the largest 23% of all possible outcomes. 
C. Q(50%) is the interquartile range. 

D. x can only take on integer values. 


3. For a random variable, X, the variance of Y = a + bX is: 
A. a? + bož. 
B. bož. 
C. bož. 
D. a + b?ož. 


KEY CONCEPTS 


LO 13.a 
A probability mass function (PMF), f (x), gives us the probability that a discrete random 
variable will take on the value x. 


A cumulative distribution function (CDF), F(x), gives us the probability that a random 
variable X will take on a value less than or equal to x. 


LO 13.b 


The expected value of a discrete random variable is the probability-weighted average of 
the possible outcomes (i.e., the mean of the distribution). 


LO 13.c 

Four commonly used moments of a random variable are its mean, variance (standard 
deviation), skewness, and kurtosis. The mean is the expected value of the random 
variable, variance is a measure of dispersion, skewness is a measure of symmetry, and 
kurtosis is a measure of the proportion of the outcomes in the tails of the distribution. 


LO 13.d 

A PMF provides the probability that a discrete random variable will take on a given 
value. A PDF provides the probability that the outcome for a continuous random 
variable will be within a given interval. 


LO 13.e 

A quantile is the percentage of outcomes less than a given outcome. A quantile function, 
Q(x% ), provides the value of an outcome which is greater than x% of all possible 
outcomes. Q(50%) is the median of a distribution. 50% of the outcomes are greater 


than the median and 50% of the outcomes are less than the median. The interquartile 
range is an interval that includes the central 50% of all possible outcomes. 


LO 13.f 

For a variable Y = a + bX (a linear transformation of X): 

= the mean of Y is E(Y) =a + bE(X); 

= the variance of Y is o = b*«? and the standard deviation is sy =|blo,; 
a the skew of Y = skew X, for b > 0, and skew Y = -skew X for b < 0; and 
= the kurtosis of Y = kurtosis X. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 13.1 


1.C F(4) is the probability that the random variable will take on a value of 4 or less. 
We can calculate P(X < 4) as 1/15 + 2/15 + 3/15 + 4/15 = 66.7%, or by 
subtracting 5/15, P(X = 5), from 100% to get 66.7%. (LO 13.a) 


2.B The expected value is computed as: (4)(10%) + (3)(30%) + (2)(40%) + (1)(20%) = 
2.3%. 
(LO 13.b) 


Module Quiz 13.2 


1.B High kurtosis indicates that the probability in the tails (extreme outcomes) are 
greater (i.e., the distribution will have fatter tails). (LO 13.c) 


Module Quiz 13.3 


1.B A PDF evaluated between minus infinity and a given value gives the probability of 
an outcome less than the given value; the same information is provided by a CDF. 
A PDF provides the probabilities only for a continuous random variable. The 


probability that a continuous random variable will take on a given value is zero. 
(LO 13.d) 


2.A Q(23%) gives us a value that is greater than 23% of all outcomes and the CDF for 


that value is the probability of an outcome less than that value (i.e., 23%). (LO 
13.e) 


3.C The variance of Y is b*02, where o? is the variance of X. (LO 13.f) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 3. 


READING 14 


COMMON UNIVARIATE RANDOM 
VARIABLES 


Study Session 4 


EXAM FOCUS 


This reading explores the following common probability distributions: uniform, 
Bernoulli, binomial, Poisson, normal, lognormal, chi-squared, Student’s t-, F-, 
exponential, and beta. You will learn the properties, parameters, and common 
occurrences of these distributions. For the exam, focus most of your attention on the 
binomial, normal, and Student’s t-distributions. Also, know how to standardize a 
normally distributed random variable, how to use a z-table, and how to construct 
confidence intervals. 


LO 14.a: Distinguish the key properties and identify the common occurrences of 
the following distributions: uniform distribution, Bernoulli distribution, 
binomial distribution, Poisson distribution, normal distribution, lognormal 
distribution, Chi-squared distribution, Student’s t and F-distributions. 


MODULE 14.1: UNIFORM, BERNOULLI, BINOMIAL, 
AND POISSON DISTRIBUTIONS 


The Uniform Distribution 


The continuous uniform distribution is defined over a range that spans between 
some lower limit, a, and some upper limit, b, which serve as the parameters of the 
distribution. Outcomes can only occur between a and b, and because we are dealing 
with a continuous distribution, even if a < x < b, P(X = x) = 0. Formally, the properties of 
a continuous uniform distribution may be described as follows. 


For all a < x4 < x, <b (ie, for all x, and x, between the boundaries a and b): 


= P(X <aor X>b)=0 (ie, the probability of X outside the boundaries is zero). 
» P(x, <X <x>) = (kX) - X1) / (b - a). This defines the probability of outcomes between 
xı and x3. 


Don't miss how simple this is just because the notation is so mathematical. For a 
continuous uniform distribution, the probability of outcomes in a range that is one-half 
the whole range is 50%. The probability of outcomes in a range that is one-quarter of 
the possible range is 25%. 


EXAMPLE: Continuous uniform distribution 


X is uniformly distributed between 2 and 12. Calculate the probability that X will be 
between 4 and 8. 


Answer: 
8—4 4 
= — = 40% 
12 —2 10 


The following figure illustrates this continuous uniform distribution. Note that the 
area bounded by 4 and 8 is 40% of the total probability between 2 and 12 (which is 
100%). 
Continuous Uniform Distribution 
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The cumulative distribution function (CDF) is linear over the variable’s range. The 
CDF for the distribution in the previous example, P(X < x), is shown in Figure 14.1. 


Figure 14.1: CDF for a Continuous Uniform Variable 
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The probability density function (PDF) for a continuous uniform distribution is 
expressed as: 


l 
fix) = fora < x < b, else x) = 0 
b—a 


The mean and variance, respectively, of a uniform distribution are: 


a+b 
E(x) = 


(b — a)? 


12 


Var(x) = 


The Bernoulli Distribution 


A Bernoulli random variable only has two possible outcomes. The outcomes can be 
defined as either a success or a failure. The probability of success, p, may be denoted 
with the value 1 and the probability of failure, 1 - p, may be denoted with the value 0. 
Bernoulli distributed random variables are commonly used for assessing the 
probability of binary outcomes, such as the probability that a firm will default on its 
debt over some interval. 


For a Bernoulli random variable for which the P(x = 1) = p, the probability mass 
function is f (x) = p* (1 - p) 1. This yields P(x = 1) = p and P(x = 0) = 1 - p. 


For a Bernoulli random variable, u, = p and the variance is given by Var(X) = p(1 - p). 


Note that the variance is low for values of p close to 1 or 0, and the maximum variance 
is for p = 0.5. 


For a Bernoulli random variable with possible outcomes 0 and 1, the CDF is: 
0 x < 0 
F(x) = 1-—p O<x <I] 
I >i 
Note that while the probability mass function (PMF) for this Bernoulli variable is 
defined only for X = 0 or 1, the corresponding CDF is defined for all real numbers. 


The Binomial Distribution 


A binomial random variable may be defined as the number of successes in a given 
number of Bernoulli trials, whereby the outcome can be either success or failure. The 
probability of success, p, is constant for each trial and the trials are independent. Under 
these conditions, the binomial probability function defines the probability of exactly x 
successes in n trials. It can be expressed using the following formula: 


p(x) = P(X = x) = (number of ways to choose x from n)p*(1 — p)" 7 


where: 

i i n! 
(number of ways to choose x from n) = —————— 
(n — x)!x! 


So, the probability of exactly x successes in n trials is: 


n! 


(1 p)” x 


ix) = ——p" 
i (n xix!” 


EXAMPLE: Binomial probability 


Assuming a binomial distribution, compute the probability of drawing three black 
beans from a bowl of black and white beans if the probability of selecting a black 
bean in any given attempt is 0.6. You will draw five beans from the bowl. 


Answer: 
5! 


213! 


POE =3) = 93) = (0.6)° (0.4)? = (120/12\0.216\0.160) 


= 0.3456 


Some intuition about these results may help you remember the calculations. Consider 
that a (very large) bowl of black and white beans has 60% black beans and each time 
you Select a bean, you replace it in the bowl before drawing again. We want to know the 
probability of selecting exactly three black beans in five draws, as in the previous 
example. 


One way this might happen is BBBWW. Because the draws are independent, the 
probability of this is easy to calculate. The probability of drawing a black bean is 60%, 
and the probability of drawing a white bean is 1 - 60% = 40%. Therefore, the 
probability of selecting BBBWW, in order, is 0.6 x 0.6 x 0.6 x 0.4 x 0.4 = 3.456%. This is 
the p°(1 - p)? from the formula and p is 60%, the probability of selecting a black bean 
on any single draw from the bowl. BBBWW is not, however, the only way to choose 
exactly three black beans in five trials. Another possibility is BBWWB, and a third is 
BWWBB. Each of these will have exactly the same probability of occurring as our initial 
outcome, BBBWW. That’s why we need to answer the question of how many ways 
(different orders) there are for us to choose three black beans in five draws. Using the 
5 


= 10 ways; 10 x 3.456% = 34.56%, the answer we computed 


formula, there are ———— 
(5 — 3)!3! 


previously. 


For a given series of n trials, the expected number of successes, or E(X), is given by the 
following formula: 


expected value of X = E(X) = np 


The intuition is straightforward; if we perform n trials and the probability of success on 
each trial is p, we expect np successes. 


The variance of a binomial random variable is given by: 


variance of X = np(1 - p) 


EXAMPLE: Expected value of a binomial random variable 


Based on empirical data, the probability that the Dow Jones Industrial Average 
(DJIA) will increase on any given day has been determined to equal 0.67. Assuming 
the only other outcome is that it decreases, we can state p(UP) = 0.67 and p(DOWN) 
= 0.33. Further, assume that movements in the DJIA are independent (i.e., an increase 
in one day is independent of what happened on another day). 


Using the information provided, compute the expected value of the number of up 
days in a five-day period. 


Answer: 


Using binomial terminology, we define success as UP, so p = 0.67. Note that the 
definition of success is critical to any binomial problem. 


E(X|n = 5, p = 0.67) = (5)(0.67) = 3.35 


Recall that the “|” symbol means given. Hence, the preceding statement is read as: 
the expected value of X given that n = 5, and the probability of success = 67% is 3.35. 


Using the equation for the variance of a binomial distribution, we find the variance 
of X to be: 


Var(X) = np(1 - p) = 5(0.67)(0.33) = 1.106 


We should note that because the binomial distribution is a discrete distribution, the 
result X = 3.35 is not possible. However, if we were to record the results of many 
five-day periods, the average number of up days (successes) would converge to 3.35. 


Binomial distributions are used extensively in the investment world where outcomes 
are typically seen as successes or failures. In general, if the price of a security goes up, it 
is viewed as a success. If the price of a security goes down, it is a failure. In this context, 
binomial distributions are often used to create models to aid in the process of asset 
valuation. 


LT PROFESSOR'S NOTE 
ê We will examine binomial trees for stock option valuation in Book 4. 


The Poisson Distribution 


The Poisson distribution is a discrete probability distribution with a number of real- 
world applications. For example, the number of defects per batch in a production 
process or the number of 911 calls per hour are discrete random variables that follow a 
Poisson distribution. 


While the Poisson random variable X refers to the number of successes per unit, the 
parameter lambda (A) refers to the average or expected number of successes per unit. 
The mathematical expression for the Poisson distribution for obtaining X successes, 
given that A successes are expected, is: 


\Ze-4 


PX = x) =- 


x! 


An interesting feature of the Poisson distribution is that both its mean and variance are 
equal to the parameter, A. 


EXAMPLE: Using the Poisson distribution (1) 


On average, the 911 emergency switchboards receive 0.1 incoming calls per second. 
Assuming the arrival of calls follows a Poisson distribution, what is the probability 
that in a given minute exactly 5.0 phone calls will be received? 


Answer: 


We first need to convert the seconds into minutes. Note that A, the expected number 
of calls per minute, is (0.1)(60) = 6.0. Hence: 


S 
6°e ° 


P(X = 5) 


= 0.1606 = 16.06% 


5! 


This means that, given the average of 0.1 incoming calls per second, there is a 16.06% 
chance there will be five incoming phone calls in a minute. 


EXAMPLE: Using the Poisson distribution (2) 


Assume there is a 0.01 probability of a patient experiencing severe weight loss as a 
side effect from taking a recently approved drug used to treat heart disease. What is 
the probability that out of 200 such procedures conducted on different patients, five 
patients will develop this complication? Assume that the number of patients 
developing the complication from the procedure is Poisson distributed. 


Answer: 


Let X = expected number of patients developing the complication from the procedure 
= np = (200)(0.01) = 2 


Me A 25e 2 
P(X = 5) = = = 0.036 = 3.6% 


This means that given a complication rate of 0.01, there is a 3.6% probability that 5 
out of every 200 patients will experience severe weight loss from taking the drug. 


=) MODULE QUIZ 14.1 


* 4. If 5% of the cars coming off the assembly line have some defect in them, what is the 
probability that out of three cars chosen at random, exactly one car will be 
defective? Assume that the number of defective cars has a Poisson distribution. 

A. 0.129. 
B. 0.135. 
C. 0.151. 
D. 0.174. 


2. A recent study indicated that 60% of all businesses have a web page. Assuming a 
binomial probability distribution, what is the probability that exactly four businesses 
will have a web page in a random sample of six businesses? 

A. 0.138. 
B. 0.276. 
C. 0.311. 
D. 0.324. 


3. What is the probability of an outcome being between 15 and 25 for a random variable 
that follows a continuous uniform distribution within the range of 12 to 28? 
A. 0.509. 
B. 0.625. 
C. 1.000. 


D. 1.600. 


MODULE 14.2: NORMAL AND LOGNORMAL 
DISTRIBUTIONS 


The Normal Distribution 


The normal distribution is important for many reasons. Many of the random variables 
that are relevant to finance and other professional disciplines follow a normal 
distribution. In the area of investment and portfolio management, the normal 
distribution plays a central role in portfolio theory. 


The PDF for the normal distribution is: 


i ae 


f{x) = 


The normal distribution has the following key properties: 


= It is completely described by its mean, p, and variance, o°, stated as X ~ N(p, 0”). In 
words, this says, “X is normally distributed with mean p and variance o?” 

= Skewness = 0, meaning the normal distribution is symmetric about its mean, so that 
P(X < u) = P(u < X) = 0.5, and mean = median = mode. 

« Kurtosis = 3; this is a measure of how the distribution is spread out with an emphasis 
on the tails of the distribution. Excess kurtosis is measured relative to 3, the kurtosis 
of the normal distribution. 


= A linear combination of normally distributed independent random variables is also 
normally distributed. 


= The probabilities of outcomes further above and below the mean get smaller and 
smaller but do not go to zero (the tails get very thin but extend infinitely). 


Many of these properties are evident from examining the graph of a normal 
distribution’s PDF as illustrated in Figure 14.2. 


Figure 14.2: Normal Distribution PDF 
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A confidence interval is a range of values around the expected outcome within which 
we expect the actual outcome to be some specified percentage of the time. A 95% 
confidence interval is a range that we expect the random variable to be in 95% of the 
time. For a normal distribution, this interval is based on the expected value (sometimes 
called a point estimate) of the random variable and on its variability, which we 
measure with standard deviation. 


Confidence intervals for a normal distribution are illustrated in Figure 14.3. For any 

normally distributed random variable, 68% of the outcomes are within one standard 
deviation of the expected value (mean), and approximately 95% of the outcomes are 
within two standard deviations of the expected value. 


Figure 14.3: Confidence Intervals for a Normal Distribution 
Probability 


In practice, we will not know the actual values for the mean and standard deviation of 
the distribution, but will have estimated them as X and s. The three confidence intervals 
of most interest are given by the following: 


=a The 90% confidence interval for X is X - 1.65s to X + 1.65s. 
= The 95% confidence interval for X is X - 1.96s to X + 1.96s. 
= The 99% confidence interval for X is X - 2.58s to X + 2.58s. 


EXAMPLE: Confidence intervals 


The average return of a mutual fund is 10.5% per year and the standard deviation of 
annual returns is 18%. If returns are approximately normal, what is the 95% 
confidence interval for the mutual fund return next year? 


Answer: 


Here u and o are 10.5% and 18%, respectively. Thus, the 95% confidence interval for 
the return, R, is: 


10.5 + 1.96(18) = -24.78% to 45.78% 
Symbolically, this result can be expressed as: 


P(-24.78 < R < 45.78) = 0.95 or 95% 


The interpretation is that the annual return is expected to be within this interval 
95% of the time, or 95 out of 100 years. 


The Standard Normal Distribution 

A standard normal distribution (i.e., z-distribution) is a normal distribution that has 
been standardized so it has a mean of zero and a standard deviation of 1 [i.e., N~(0,1)]. 
To standardize an observation from a given normal distribution, the z-value of the 
observation must be calculated. The z-value represents the number of standard 
deviations a given observation is from the population mean. Standardization is the 
process of converting an observed value for a random variable to its z-value. The 
following formula is used to standardize a random variable: 


observation — population mean xX— J 


z = eS - 


standard deviation a 


LT PROFESSOR'S NOTE 
ê The term z-value will be used for a standardized observation in this reading. 
The terms z-score and z-statistic are also commonly used. 


EXAMPLE: Standardizing a random variable (calculating z-values) 


Assume the annual earnings per share (EPS) for a population of firms are normally 
distributed with a mean of $6 and a standard deviation of $2. 


What are the z-values for EPS of $2 and $8? 

Answer: 
If EPS = x = $8, then z = (x - u) / o = ($8 - $6) / $2 = +1 
If EPS = x = $2, then z = (x - u) / o = ($2 - $6) / $2 = -2 


Here, z = +1 indicates that an EPS of $8 is one standard deviation above the mean, 
and z = -2 means that an EPS of $2 is two standard deviations below the mean. 


Calculating Probabilities Using z-Values 


Now we will show how to use standardized values (z-values) and a table of 
probabilities for Z to determine probabilities. A portion of a table of the CDF fora 
standard normal distribution is shown in Figure 14.4. We will refer to this table as the 
z-table, as it contains values generated using the cumulative density function for a 
standard normal distribution, denoted by F(Z). Thus, the values in the z-table are the 
probabilities of observing a z-value that is less than a given value, z [i.e., P(Z < z)]. The 
numbers in the first column are z-values that have only one decimal place. The columns 
to the right supply probabilities for z-values with two decimal places. 


Note that the z-table in Figure 14.4 only provides probabilities for positive z-values. 
This is not a problem because we know from the symmetry of the standard normal 
distribution that F(—Z) = 1 - F(Z). The tables in the back of many texts provide 
probabilities for negative z-values, but we will work with only the positive portion of 
the table because this may be all you get on the exam. In Figure 14.4, we can find the 


probability that a standard normal random variable will be less than 1.66, for example. 
The table value is 95.15%. The probability that the random variable will be less than 
-1.66 is simply 1 - 0.9515 = 0.0485 = 4.85%, which is also the probability that the 
variable will be greater than +1.66. 


Figure 14.4: Cumulative Probabilities for a Standard Normal Distribution 


CDF Values for the Standard Normal Distribution: The z-Table 

Zz 00 01 .02 03 04 .05 .06 .07 08 .09 
0.0 .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359 
0.1 .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 
0.2 .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 
0.5 .6915 Please note that several of the rows have been deleted to save space.* 
1.2 .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 
16 .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 
18 .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 
1.9 9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 
2.0 .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 
2.5 .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 
3.0 .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 


*A complete cumulative Standard normal table is included in the Appendix. 


LT PROFESSOR'S NOTE 

à When you use the standard normal probabilities, you have formulated the 
problem in terms of standard deviations from the mean. Consider a security 
with returns that are approximately normal, an expected return of 10%, and 
standard deviation of returns of 12%. The probability of returns greater than 
30% is calculated based on the number of standard deviations that 30% is 
above the expected return of 10%. In this case, 30% is 20% above the 
expected return of 10%, which is 20 / 12 = 1.67 standard deviations above the 
mean. We look up the probability of returns less than 1.67 standard 
deviations above the mean (0.9525 or 95.25% from Figure 14.4) and calculate 
the probability of returns more than 1.67 standard deviations above the mean 
as 1 - 0.9525 = 4.75%. 


EXAMPLE: Using the z-table (1) 
Considering again EPS distributed with u = $6 and o = $2, what is the probability 
that EPS will be $9.70 or more? 
Answer: 
Here we want to know P (EPS > $9.70), which is the area under the curve to the right 
of the z-value corresponding to EPS = $9.70 (see the distribution that follows). 
The z-value for EPS = $9.70 is: 
(x—p) (9.70— 6) 


o 


1.85 


rA 


a 


That is, $9.70 is 1.85 standard deviations above the mean EPS value of $6. 


From the z-table, we have F(1.85) = 0.9678, but this is P(EPS < 9.70). We want P(EPS 
> 9.70), which is 1 - P(EPS < 9.70). 


P(EPS > 9.70) = 1 - 0.9678 = 0.0322, or 3.2% 
P(EPS > $9.70) 


vA 
eles 
/ N 
J ne 0.0322 
Pe eg 
ee Es 
EPS: $6.00 $9.70 
zvalues: 0 1.85 


EXAMPLE: Using the z-table (2) 

Using the distribution of EPS with p = $6 and o = $2 again, what percent of the 
observed EPS values are likely to be less than $4.10? 

Answer: 


As shown graphically in the distribution that follows, we want to know P(EPS < 
$4.10). This requires a two-step approach like the one taken in the preceding 
example. 


First, the corresponding z-value must be determined as follows: 


($4.10 — $6) 


5 


So, $4.10 is 0.95 standard deviations below the mean of $6.00. 


Now, from the z-table for negative values in the back of this book, we find that 
F(-0.95) = 0.1711, or 17.11%. 


Zz 0.95 


Finding a Left-Tail Probability 


EPS: $4.10 $6.00 
z-values: -0.95 0 +0.95 
The z-table gives us the probability that the outcome will be more than 0.95 
standard deviations below the mean. 


The Lognormal Distribution 


The lognormal distribution is generated by the function e*, where x is normally 


distributed. Because the natural logarithm, In, of e* is x, the logarithms of lognormally 
distributed random variables are normally distributed, thus the name. 


The PDF for the lognormal distribution is: 


In x-a) < 
T 
x) = e 


Figure 14.5 illustrates the differences between a normal distribution and a lognormal 
distribution. 


Figure 14.5: Normal vs. Lognormal Distributions 
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In Figure 14.5, we can see the following: 
= The lognormal distribution is skewed to the right. 


= The lognormal distribution is bounded from below by zero so that it is useful for 
modeling asset prices that never take negative values. 


If we used a normal distribution of returns to model asset prices over time, we would 
admit the possibility of returns less than -100%, which would admit the possibility of 
asset prices less than zero. Using a lognormal distribution to model price relatives 
avoids this problem. A price relative is just the end-of-period price of the asset divided 
by the beginning price (S4/Sọ) and is equal to (1 + the holding period return). To get the 


end-of-period asset price, we can simply multiply the price relative by the beginning- 
of-period asset price. Because a lognormal distribution takes a minimum value of zero, 
end-of-period asset prices cannot be less than zero. A price relative of zero corresponds 
to a holding period return of -100% (i.e., the asset price has gone to zero). 


2) MODULE QUIZ 14.2 
= 1. The probability that a normal random variable will be more than two standard 
deviations above its mean is: 
A. 0.0217. 
B. 0.0228. 
C. 0.4772. 
D. 0.9772. 


2. Which of the following random variables is least likely to be modeled appropriately by 
a lognormal distribution? 


A. The size of silver particles in a photographic solution. 

B. The number of hours a housefly will live. 

C. The return ona financial security. 

D. The weight of a meteor entering the earth's atmosphere. 


MODULE 14.3: STUDENT'S T, CHI-SQUARED, AND F- 
DISTRIBUTIONS 


Student's t-Distribution 


Student’s ¢-distribution is similar to a normal distribution, but has fatter tails (i.e. a 
greater proportion of the outcomes are in the tails of the distribution). It is the 
appropriate distribution to use when constructing confidence intervals based on small 
samples (n < 30) from a population with unknown variance and a normal, or 
approximately normal, distribution. It may also be appropriate to use the t-distribution 
when the population variance is unknown and the sample size is large enough that the 
central limit theorem will assure that the sampling distribution is approximately 
normal. 


Student’s ¢-distribution has the following properties: 


u It is symmetrical. 


= It is defined by a single parameter, the degrees of freedom (df), where the degrees of 
freedom are equal to the number of sample observations minus 1, n - 1, for sample 
means. 


= It has a greater probability in the tails (fatter tails) than the normal distribution. 


= As the degrees of freedom (the sample size) gets larger, the shape of the t-distribution 
more closely approaches a standard normal distribution. 


The degrees of freedom for tests based on sample means are n - 1 because, given the 
mean, only n - 1 observations can be unique. 


The table in Figure 14.6 contains one-tailed critical values for the t-distribution at the 
0.05 and 0.025 levels of significance with various degrees of freedom (df). Note that, 
unlike the z-table, the t-values are contained within the table and the probabilities are 
located at the column headings. 


Figure 14.6: Table of Critical t-Values 


One-Tailed Probabilities, p 
df p=0.05 p=0.025 


5 2.015 2.571 
10 1.812 2.228 
15 1.753 2.131 
20 1.725 2.086 
40 1.684 2.021 
60 1.671 2.000 
80 1.664 1.990 

100 1.660 1.984 
120 1.658 1.980 
20 1.645 1.960 


Figure 14.7 illustrates the shapes of the t-distribution associated with different degrees 
of freedom. The tendency is for the t-distribution to look more and more like the 
normal distribution as the degrees of freedom increase. Practically speaking, the 
greater the degrees of freedom, the greater the percentage of observations near the 
center of the distribution and the lower the percentage of observations in the tails, 
which are thinner as degrees of freedom increase. This means that confidence intervals 
for arandom variable that follows a t-distribution must be wider than those for a 
normal distribution, for a given confidence level. 


Figure 14.7: t-Distributions for Different Degrees of Freedom (df) 
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The Chi-Squared Distribution 


Hypothesis tests concerning population parameters and models of random variables 
that are always positive are often based on a chi-squared distribution, denoted x2. The 
chi-squared distribution is asymmetrical, bounded below by zero, and approaches the 
normal distribution in shape as the degrees of freedom increase. 


Figure 14.8: Chi-Squared Distribution 


The chi-squared test statistic, X, with n — 1 degrees of freedom, is computed as: 


(n — 1)s? 

Xn-1 = > 
a- 
“0 

where: 

n = sample size 


s* = sample variance 
ot = hypothesized value for the population variance 


The chi-squared test compares the test statistic to a critical chi-squared value at a 
given level of significance to determine whether to reject or fail to reject a null 
hypothesis. 


The F-Distribution 

Hypotheses concerning the equality of the variances of two populations are tested with 
an F-distributed test statistic. An F-distributed test statistic is used when the 
populations from which samples are drawn are normally distributed and that the 
samples are independent. 


The test statistic for the F-test is the ratio of the sample variances. The F-statistic is 
computed as: 


where: 


sf = variance of the sample of n, observations drawn from Population 1 


s2 = variance of the sample of n, observations drawn from Population 2 


An F-distribution is presented in Figure 14.9. As indicated, the F-distribution is right- 
skewed and is truncated at zero on the left-hand side. The shape of the F-distribution is 
determined by two separate degrees of freedom, the numerator degrees of freedom, df, 


and the denominator degrees of freedom, dfp. 


Note that n, - 1 and n, - 1 are the degrees of freedom used to identify the appropriate 
critical value from the F-table (provided in the Appendix). 


Some additional properties of the F-distribution include the following: 


= The F-distribution approaches the normal distribution as the number of observations 
increases (just as with the t-distribution and chi-squared distribution). 


= A random variable’s t-value squared (t?) with n - 1 degrees of freedom is F- 
distributed with 1 degree of freedom in the numerator and n - 1 degrees of freedom 
in the denominator. 


= There exists a relationship between the F- and chi-squared distributions such that: 


x4 


# of observations in numerator 


as the # of observations in denominator —> oo 


Figure 14.9: F-Distribution 


numerator df, = 10, denominator df, = 10 


The Exponential Distribution 


The exponential distribution is often used to model waiting times, such as how long it 
takes an employee to serve a customer or the time it takes a company to default. The 
PDF for this distribution is as follows: 


l 
Rx)=7 xe we x > 0 


In the previous function, the scale parameter, ß, is greater than zero and is the 
reciprocal of the rate parameter À (i.e., à = 1/ 8). The rate parameter measures the rate 
at which it will take an event to occur. In the context of waiting for a company to 
default, the rate parameter is known as the hazard rate and indicates the rate at which 
default will arrive. 


Figure 14.10 displays the PDF of the exponential distribution assuming different values 
for the rate parameter. 


Figure 14.10: Exponential PDF 


L5 


The exponential distribution is able to assess the time it takes a company to default. 
However, what if we want to evaluate the total number of defaults over a specific 
period? As it turns out, the number of defaults up to a certain period, N, follows a 


Poisson distribution with a rate parameter equal to t / B. 


We can further examine the relationship between the exponential and Poisson 
distributions by considering the mean and variance of both distributions. Recall that 
the mean and variance of a Poisson-distributed random variable is equal to A. As it 
turns out, the mean of the exponential distribution is equal to 1 / A, and the variance is 


equal to 1 / 22. 


The Beta Distribution 


The beta distribution can be used for modeling default probabilities and recovery 
rates. As a result, it is used in some credit risk models such as CreditMetrics®, which 
will be discussed in the FRM Part II curriculum. The mass of the beta distribution is 
located between the intervals zero and one. As you can see from Figure 14.11, this 
distribution can be symmetric or skewed depending on the values of its shape 
parameters (8 and a). 


Figure 14.11: Beta PDF 


Mixture Distributions 


LO 14.b: Describe a mixture distribution and explain the creation and 
characteristics of mixture distributions. 


The distributions discussed in this reading, as well as other distributions, can be 
combined to create unique PDFs. It may be helpful to create a new distribution if the 
underlying data you are working with does not currently fit a predetermined 
distribution. In this case, a newly created distribution may assist with explaining the 
relevant data. 


To illustrate a mixture distribution, suppose that the returns of a stock follow a normal 
distribution with low volatility 75% of the time and high volatility 25% of the time. 
Here, we have two normal distributions with the same mean, but different risk levels. 
To create a mixture distribution from these scenarios, we randomly choose either the 
low or high volatility distribution, placing a 75% probability on selecting the low 
volatility distribution. We then generate a random return from the selected 
distribution. By repeating this process several times, we will create a probability 
distribution that reflects both levels of volatility. 


Mixture distributions contain elements of both parametric and nonparametric 
distributions. The distributions used as inputs (i.e., the component distributions) are 
parametric, while the weights of each distribution within the mixture are 
nonparametric. The more component distributions used as inputs, the more closely the 
mixture distribution will follow the actual data. However, more component 
distributions will make it difficult to draw conclusions given that the newly created 
distribution will be very specific to the data. 


By mixing distributions, it is easy to see how we can alter skewness and kurtosis of the 
component distributions. Skewness can be changed by combining distributions with 
different means, and kurtosis can be changed by combining distributions with different 


variances. Also, by combining distributions that have significantly different means, we 
can create a mixture distribution with multiple modes (e.g., a bimodal distribution). 


Creating a more robust distribution is clearly beneficial to risk managers. Different 
levels of skew and/or kurtosis can reveal extreme events that were previously difficult 
to identify. By creating these mixture distributions, we can improve risk models by 
incorporating the potential for low-frequency, high-severity events. 


=) MODULE QUIZ 14.3 
= 1. The t-distribution is the appropriate distribution to use when constructing 
confidence intervals based on: 
A. large samples from populations with known variance that are nonnormal. 
B. large samples from populations with known variance that are at least 
approximately normal. 
C. small samples from populations with known variance that are at least 
approximately normal. 
D. small samples from populations with unknown variance that are at least 
approximately normal. 
2. Which of the following statements about F- and chi-squared distributions is least 
accurate? Both distributions: 
A. are asymmetrical. 
B. are bound by zero on the left. 


C. are defined by degrees of freedom. 
D. have means that are less than their standard deviations. 


KEY CONCEPTS 


LO 14.a 

A continuous uniform distribution is one where the probability of X occurring ina 
possible range is the length of the range relative to the total of all possible values. 
Letting a and b be the lower and upper limit of the uniform distribution, respectively, 
then for a <x, < X, < b, 


(x, z x,) 


P(x, <X< — 
(b— a) 


x,) = 


The binomial distribution is a discrete probability distribution for a random variable, X, 
that has one of two possible outcomes: success or failure. The probability of a specific 
number of successes in n independent binomial trials is: 
n! n—x 
pœ) = PX = x) = ————_p*(1 — p) 
(n — x)!x! 

where p = probability of success in a given trial 
The Poisson random variable, X, refers to a specific number of successes per unit. The 
probability for obtaining X successes, given a Poisson distribution with parameter å, is: 


Me A 


x! 


PX = x)= 


The normal probability distribution has the following characteristics: 


= The normal curve is symmetrical and bell-shaped with a single peak at the exact 
center of the distribution. 


» Mean = median = mode, and all are in the exact center of the distribution. 


= The normal distribution can be completely defined by its mean and standard 
deviation because the skew is always 0 and kurtosis is always 3. 


A standard normal distribution is a normal distribution with a mean of 0 anda 
standard deviation of 1. A normal random variable, x, can be normalized (changed to a 
standard normal, z) with the transformation z = (x - mean of x) / standard deviation of 
X. 


A lognormal distribution exists for random variable Y, when Y = e¥ and X is normally 
distributed. 


The t-distribution is similar, but not identical, to the normal distribution in shape—it is 
defined by the degrees of freedom and has fatter tails. The t-distribution is used to 
construct confidence intervals for the population mean when the population variance 
is not known. 


Degrees of freedom for the t-distribution is equal to n — 1; Student’s t-distribution is 
closer to the normal distribution when df is greater, and confidence intervals are 
narrower when df is greater. 


The chi-squared distribution is asymmetrical, bounded below by zero, and approaches 
the normal distribution in shape as the degrees of freedom increase. 


The F-distribution is right-skewed and is truncated at zero on the left-hand side. The 
shape of the F-distribution is determined by two separate degrees of freedom. 


LO 14.b 


Mixture distributions combine the concepts of parametric and nonparametric 
distributions. The component distributions used as inputs are parametric while the 
weights of each distribution within the mixture are based on historical data, which is 
nonparametric. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 14.1 

1.A The probability of a defective car (p) is 0.05; hence, the probability of a 
nondefective car (q) = 1 - 0.05 = 0.95. Assuming a Poisson distribution: 
A= np = (3)(0.05) = 0.15 


Then, 


Ate A (0.15)'e 0.15 
P(X = 1) = a ——— = 0.129106 
xX! | 


(LO 14.a) 


2.C Success = having a web page: 
[6! / 4!1(6 - 4)!](0.6)*(0.4)° ~ * = 15(0.1296)(0.16) = 0.311 
(LO 14.a) 

3.B Since a= 12 and b = 28: 

(25-15 10 


P(15 < X < 25) = = 
(28 — 12) lő 


= 0.625 


(LO 14.a) 


Module Quiz 14.2 
1.B 1- F(2) = 1 - 0.9772 = 0.0228 
(LO 14.a) 
2.C A lognormally distributed random variable cannot take on values less than zero. 


The return on a financial security can be negative. The other choices refer to 
variables that cannot be less than zero. (LO 14.a) 


Module Quiz 14.3 


1.D The t-distribution is the appropriate distribution to use when constructing 
confidence intervals based on small samples from populations with unknown 
variance that are either normal or approximately normal. (LO 14.a) 


2. D There is no consistent relationship between the mean and standard deviation of 
the chi-squared or F-distributions. (LO 14.a) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 4. 


READING 15 
MULTIVARIATE RANDOM VARIABLES 


Study Session 4 


EXAM FOCUS 


This reading covers the dependency of multivariate random variables. For the exam, be 
prepared to explain and calculate the mean and variance for bivariate random 
variables. The dependency between the components is important, and you should 
understand the calculation for covariance and correlation. The marginal and 
conditional distributions are used to transform bivariate distributions and provide 
additional insights for finance and risk management. Be able to use these distributions 
to compute a conditional expectation and conditional moments that summarize the 
conditional distribution of a random variable. 


MODULE 15.1: MARGINAL AND CONDITIONAL 
DISTRIBUTIONS FOR BIVARIATE DISTRIBUTIONS 


Probability Matrices 


LO 15.a: Explain how a probability matrix can be used to express a probability 
mass function. 


A random variable is an uncertain quantity or number. Multivariate random 
variables are vectors of random variables where a vector is a dimension of n random 
variables. Thus, the study of multivariate random variables includes measurements of 
dependency between two or more random variables. In this reading, we will examine 
bivariate random variables or components, which is a special case of the n dimension 
multinomial distribution. The bivariate random variable X is a vector with two 
components: X, and Xp. 


A probability mass function (PMF) for a bivariate random variable describes the 
probability that two random variables each take a specific value. The PMF of a 
bivariate random variable is: 


f (Xp X) PX, = Xp X, = x,) 


my 


A probability matrix illustrates the following properties of a PMF: 


=a The probability matrix describes the outcome probabilities as a function of the 
coordinates x, and xz. 


= All probabilities are positive or zero and are less than or equal to 1. 


= The sum across all possible outcomes for X} and X, equals 1. 


A probability matrix is used to describe the relationship between discrete distributions 
defined over a finite set of values. The most common application of a discrete bivariate 
random variable is the trinomial distribution. In this type of example, there are n 
independent trials, and each trial has one of three discrete possible outcomes. The 
trinomial distribution has three parameters: the number of trials (n), the probability of 
observing outcome 1 (p,), and the probability of observing outcome 2 (p2). The sum of 
all probabilities of each outcome occurring must always equal 100%. Therefore, the 
probability of the third outcome occurring is found by subtracting p, and p, from 1 as 
follows: 


p3=1-p,-P2 


A probability matrix can be created that summarizes the probability of each outcome 
occurring. 


EXAMPLE: Applying a probability matrix 


Suppose that a company’s common stock return is related to earnings 
announcements. Earnings announcements are either positive, neutral, or negative 
and are labeled as 1, 0, and -1, respectively. Assume that the company’s monthly 
stock return must be one of three possible outcomes, -3%, 0%, or 3%. An analyst 
estimates the probability matrix in Figure 15.1 for earnings announcements and 
stock returns. Compute the probability of a negative earnings announcement. 


Figure 15.1: Probability Matrix for Bivariate Random Variables 


~ Stock Return (X,) 
= -3% 0% 3% 
Ëh 
= Negative -1 25% 15% 0% 
E Neutral 0 5% 10% 15% 
= Positie 1 0% 5% 25% 
Answer: 


The sum of all probabilities in the first row of the probability matrix states that 
there is a 40% probability of a negative announcement. Also, there is a 25% 
probability of a negative announcement and a -3% return, a 15% probability of a 
negative announcement and a 0% return, and a 0% probability of a negative 
announcement and a 3% return. 


Marginal and Conditional Distributions 


LO 15.b: Compute the marginal and conditional distributions of a discrete 
bivariate random variable. 


A marginal distribution defines the distribution of a single component of a bivariate 
random variable (i.e., a univariate random variable). Thus, the notation for the marginal 
PMF is the same notation for a univariate random variable: 


f (Xp X2) 


ay ‘in 2 fs 
WY 2x) wa 


The computation of a marginal distribution can be shown using the previous example 
of earnings announcements and monthly stock returns. Summing across columns 
constructs the marginal distribution of the row variables in a probability matrix. 
Summing across rows constructs the marginal distribution for the column variables in 
a probability matrix. 


EXAMPLE: Marginal distributions 


Using the probability matrix in Figure 15.1, compute the marginal PMF for the 3% 
monthly stock return as well as the marginal PMF for a 0% and -3% monthly stock 
return. 


Answer: 


The marginal PMF for a 3% monthly stock return, X4, is calculated by summing the 
probabilities of all outcomes of 3% across all values based on the earnings 
announcements, X3. 


f (3%)= XY (3%, x.) = 0% + 15% + 25% = 40% 
x,=(- 1,0,1) 
The marginal PMF for a 0% and -3% monthly stock return are as follows: 


f (0%)= E = f(0%, x 2)= 15% + 10% + 5% = 30% 
x={— 1,0,1) 


f. (—3%)= E f(—3%,x 2)=25%+5%+0% = 30% 
: x,={—1,0,1) 
Thus, the complete marginal PMF for monthly stock returns, Xj, is the following. 
Return -3% 0% 3% 
Probability 30% 30% 40% 


Figure 15.2 illustrates that the sum of the columns and rows in Figure 15.1 are labeled 
as the marginal PMF for X, and X, Note that the sum of all possible outcomes for the 
monthly stock return, X4, equals 1 at the bottom of Figure 15.2 (30% + 30% + 40% = 
100%). Similarly, the sum of all possible outcomes for the earnings announcements at 
the right of Figure 15.2 must also equal 1 (40% + 30% + 30% = 100%). 


Figure 15.2: Marginal PMFs of X, and X3 


Stock Return X) 
-3% 0% 3% fx,(x,) 
_. Negative -1 25% 15% 0% 40% 
> 
rA Neutral 0 5% 10% 15% 30% 
= 
E Positive 1 0% 5% 25% 30% 


Fx,(x,) 30% 3 0% 40% 


A conditional distribution sums the probabilities of the outcomes for each 
component conditional on the other component being a specific value. A conditional 
PMF is defined based on the conditional probability for a bivariate random variable X, 


given X> as: 


Fy x Xp Xa) 


Fe )x%1 | X2 = E 
The numerator in this equation is the joint probability of two events occurring, and the 
denominator is the marginal probability that X, = x,. Continuing with the previous 
example, we can determine the three possible outcomes of monthly stock returns given 
a negative earnings announcement (x, = -1). When there is a negative earnings 
announcement, the three probabilities in the first row of the bivariate probability 
matrix illustrated in Figure 15.2 are 25%, 15%, and 0% for monthly returns of -3%, 0%, 
and 3%, respectively. These joint probabilities are then divided by the marginal 
probability of a negative earnings announcement. This is summarized in the upper 
right-hand corner of Figure 15.2 as 40%. 


Thus, the conditional PMF for X, = -1 is summarized as follows: 
Retura -3% 0% 3% 


Probability 25% / 40% = 62.5% 15% / 40% = 37.5% 0% / 40% = 0.0% 


=| MODULE QUIZ 15.1 


Use the following information to answer Questions 1 and 2. 


Suppose a hedge fund manager expects a stock to have three possible returns (-6%, 0%, 
6%) following negative, neutral, or positive changes in analyst ratings, respectively. The 
fund manager constructs the following bivariate probability matrix for the stock. 


Stock Return (X,) 
-6% 0% 6% 
Negative -1 30% 15% 0% 
Neutral 0 10% 10% 5% 
Positive 1 0% 5% 25% 


Ratings (X,) 


1. What is the marginal probability that the stock has a positive analyst rating? 


A. 10%. 
B. 15%. 
C. 25%. 
D. 30%. 
2. What are the conditional probabilities of the three monthly stock returns given that 
the analyst rating is positive? 


A. 0.0% 16.7% 83.3% 
B. 66.7% 33.3% 0.0% 
C. 40.0% 40.0% 0.0% 
D. 15.0% 10.0% 5.0% 


MODULE 15.2: MOMENTS OF BIVARIATE RANDOM 
DISTRIBUTIONS 


Expectation of a Bivariate Random Function 


LO 15.c: Explain how the expectation of a function is computed for a bivariate 
discrete random variable. 


The first moment of a bivariate discrete random variable is referred to as an 
expectation of a function. The expectation of a bivariate random function g(X,,X,) is a 


probability-weighted average of the function of the outcomes g(x,,x,) and is expressed 
as follows: 


n 
> xx, x, Ff, 


2 1 xeX1 Xp) 
x,ER(X )x,E R(X) . 


The function g (x;,x,) depends on x, and x, but may only be a function of one of the 
components. 


EXAMPLE: Computing expectation of a bivariate random function 


Compute the expectation of the function g(x ,x,)= x, using the joint PMF presented 
in Figure 15.3. 


Figure 15.3: Joint PMF for x, and xz 


Answer: 


The expectation is computed as follows: 


Elea, 2 D BX, XQ) Fg 1, %) 
x,ER(X,)x,ER(X,) ie 


= 23(0.25) +25(0.45)+4 5(0,05)+4°(0.25) 


2.0 + 14.4 + 3,2+ 256.0 = 275.6 


Covariance and Correlation Between Random Variables 


LO 15.d: Define covariance and explain what it measures. 


Expectations of bivariate random variables are used to describe relationships in the 
same way that they are used to define moments for univariate random variables. For 
example, the expected return for a stock is used to define the variance of the stock ina 
univariate random number. The first moment of X = [Xj, X2] is the expected mean of the 
components, E[X]. The second moment of a bivariate random X has two components 
and is calculated as a covariance. 


Covariance is the expected value of the product of the deviations of the two random 
variables from their respective expected values. Common notations for the covariance 
between random variables X and Y are Cov(X,Y) and oyy. Covariance measures how two 
variables move with each other or the dependency between the two variables. The 
covariance of a multivariate random variable X is a 2-by-2 matrix, where the values 
along one diagonal are the variances of X4 and X>. The values along the other diagonal 
are the covariance between X, and X}. For bivariate random variables, there are two 
variances and one covariance. 


The calculations for variances for bivariate random numbers are analogous to the 
calculation of dispersion for univariate numbers, where the distance of observations 
from the expected mean is squared as follows: 


Var[X] = E[(X, - E[X4])7] 

The covariance between X4 and X; is calculated as: 
Cov[X,,X2] = E[(X1 - E[X;]) (X: - E[X2])] 
Cov[X,,X] = E[X,X] - E[X,JE[X] 


The variances and covariance of two components of X are expressed in a 2-by-2 matrix 
of X as: 


Cov[X]= ( “1 = 


San O3 
y ian 


To aid in the definition of covariance, consider the returns of a stock and of a put option 
on the stock. These two returns will have a negative covariance because they move in 
opposite directions. The returns of two automotive stocks would likely have a positive 
covariance, and the returns of a stock and a riskless asset would have a zero covariance 


because the riskless asset’s returns never move, regardless of movements in the stock’s 
return. 


LO 15.e: Explain the relationship between the covariance and correlation of two 
random variables and how these are related to the independence of the two 
variables. 


In practice, covariance is difficult to interpret because it depends on the scales of X4 
and X2. Thus, it can take on extremely large values, ranging from negative to positive 
infinity, and, like variance, these values are expressed in terms of squared units. 


To make the covariance of two random variables easier to interpret, it may be divided 
by the product of the bivariate random variables’ standard deviations. The resulting 
value is called the correlation coefficient, or simply, correlation. The relationship 
between covariances, standard deviations, and correlations can be seen in the following 
expression for the correlation of two bivariate random variables X, and X3: 


Cov[X,,X,] 71> Sia 
Corr(X, ,X,) = 


VVar[X,] Varix] yolo 712 


Correlation measures the strength of the linear relationship between two variables and 
ranges from -1 to +1 for two variables (i.e., -1 < Corr(X,, X2) < +1). Two variables that 
are perfectly positively correlated have a correlation coefficient of 1, two variables that 
are perfectly negatively coefficient have a correlation coefficient of -1, and two 
variables that are independent have a correlation of 0 (i.e., no linear relationship). 
However, a correlation of 0 does not necessarily imply independence. 


=) MODULE QUIZ 15.2 
1. What is the expectation of the function gix.,, x2) =x}2 using the following joint PMF? 


| 3 | 6 | 
a 
f 


A. 226.4. 
B. 358.9. 
C. 394.7. 
D. 413.6. 


2. A hedge fund manager computed the covariances between two bivariate random 
variables. However, she is having difficulty interpreting the implications of the 
dependency between the two variables as the scale of the two variables are very 
different. Which of the following statements will most likely benefit the fund 
manager when interpreting the dependency for these two bivariate random variables? 

A. Compute the correlation by multiplying the covariance of the two variables by the 
product of the two variables’ standard deviations. 

B. Disregard the covariance for bivariate random variables as this data is not 
relevant due to the nature of bivariate random variables. 


C. Compute the correlation by dividing the covariance of the two variables by the 
product of the two variables’ standard deviations. 

D. Divide the larger scale variables by a common denominator and rerun the 
estimations of covariance by subtracting each variable's expected mean. 


MODULE 15.3: BEHAVIOR OF MOMENTS FOR 
BIVARIATE RANDOM VARIABLES 


Linear Transformations 


LO 15.f: Explain the effects of applying linear transformations on the covariance 
and correlation between two random variables. 


There are four important effects of a linear transformation on the covariance of 
bivariate random variables. The following example illustrates the effects. 


Suppose there is a linear relationship between X, and X, where: 
X, =a + bX, 


The first effect of a linear transformation on the covariance of two random variables is 
that the sign of b determines the correlation between the components. The correlation 
between X, and X, will be equal to either 


a 1ifb>0, 
a Oifb=0,or 
» -1ifb<0. 


A second effect of linear transformations on covariance is that the amount or scale of a 
has no effect on the variance, and the scale of b determines the scale or changes in the 


variance by b’. This is true because the variance of the linear relationship is equal to: 
b*Var[X,] 


A third effect of linear transformations on covariance is that the scale of covariance is 
determined by two variables, b and d, as follows: 


Cov[a + bX), c + dX,] = bdCov[X,,X,] 


Recall that covariance is defined as the deviation from the expected mean of one 
random variable multiplied by the deviation from the expected mean of the other 
random variable. Therefore, location shifts have no impact on the variance or 
covariance calculations, because only the deviations from the respective means are 
relevant. However, the scale of each component (b and d) contributes multiplicatively 
to the change in covariance. We can also extend the first effect and show that the 
correlation is scale free and is either +1 or -1 when a or b are not equal to zero. 


The fourth effect of linear transformations on covariance between random variables 
relates to coskewness and cokurtosis. 


Coskewness and cokurtosis are cross variable versions of skewness and kurtosis. 
Interpreting the meaning of coskewness and cokurtosis is not as clear as covariance. 
However, both coskewness and cokurtosis measure the direction of how one random 
variable raised to the first power is impacted when the other variable is raised to the 
second power. For example, stock returns for one variable and volatility of the returns 
for another variable tend to have negative coskewness. In this case, negative 
coskewness implies that one variable has a negative return when the other variable has 
high volatility. 


LT PROFESSOR'S NOTE 
ê The concepts of coskewness and cokurtosis will be illustrated in the next 
reading (Reading 16). 


Variance of Weighted Sum of Bivariate Random Variables 


LO 15.g: Compute the variance of a weighted sum of two random variables. 


When measuring the variance of two random variables, the covariance or comovement 
between the two variables is a key component. The variance of two random variables, 
X, and X}, is computed by summing the individual variances and two times the 


covariance: 
Var[X, + X3] = Var[X,] + Var[X,] + 2Cov[X,,X>] 
If a and b represent the weight of investment in asset X, and X>, respectively, then the 
variance of a two-asset portfolio is computed as follows: 
Var[aX, + bX,] = a’Var[X,] + b*Var[X>] + 2ZabCov[X,,X>] 
In a two-asset portfolio context, this equation is most commonly written as: 
oF = Wi of + (1 w)? 03 +2w(l —w)o,, 
The minimum variance portfolio (i.e., optimal risk weight) can then be found as: 


049 912 
m 22 2 


wW p- 
911 — 2012+ oy 


EXAMPLE: Computing variance of a two-asset portfolio 


Suppose two assets have a correlation of 0.30. Using the following covariance matrix, 
compute the variance of a two-asset portfolio with 30% in Asset 1 and 70% in Asset 


oF 8y\ 18%? ee) 
op 0 } \px 18% x9% 9%? 


Answer: 


The variance of this two-asset portfolio is computed as: 


a2, = (0.30)? (0.18)? + (0.70)? (0.09)? + 2(0.30X0.70)(0.30 x 0.18 x 0.09) 
= (0.09) (0.0324) + (0.49)(0.0081)+ 0.00204 
= 0.00292 + 0.00397 + 0.00204 = 0.00893 or 0.893% 


The standard deviation for this two-asset portfolio is 9.45%, which is found by 
taking the square root of the variance. 


Note that the optimal weight of Asset 1 with a correlation of 0.30 is approximately 
10.5%. The standard deviation of this minimum risk portfolio is approximately 8.8%. 


Figure 15.4 illustrates the impact of correlation on the standard deviation for a two- 
asset portfolio using the optimal (minimum risk) portfolio weight at different 
correlations between -1 and +1. The optimal weight of Asset 1 with a correlation of 
0.30 is approximately 10.5%. The standard deviation of this minimum risk portfolio is 
approximately 8.8%. 


We can note a couple of observations from the graph in Figure 15.4. The standard 
deviation is smallest with strong negative correlations. Second, the graph is 
asymmetrical because the larger positive correlations result in higher standard 
deviations (right-hand side of graph) than smaller negative correlations (left-hand side 
of graph). The reason for the larger correlations is because the optimal weight for the 
minimum risk portfolio is negative for the largest correlations. This results in the 
second asset having a weight greater than 1. Unfortunately, with high correlations, the 
benefits of diversification are limited with more exposure in one asset. 


Figure 15.4: Standard Deviation of Two-Asset Portfolio 
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Conditional Expectations 


LO 15.h: Compute the conditional expectation of a component of a bivariate 
random variable. 


In the context of portfolio risk management, a conditional expectation of a random 
variable is computed based on a specific event occurring. A conditional PMF is used to 
determine the conditional expectation based on weighted averages. A conditional 


distribution is defined based on the conditional probability for a bivariate random 
variable X, given X>. 


Suppose a portfolio manager creates a conditional PMF based on earnings 
announcements, X>. Earnings announcements can take on three possible outcomes: a 


positive earnings surprise, X, =1; a negative earnings surprise, X, =-1; or a neutral 
earnings announcement, X, =0. We will return to our previous example in Figure 15.2 
and use the same example referred here in Figure 15.5. 


Figure 15.5: PMF of Stock Returns, X,, Given Earnings Announcement, X3 


Stock Return (X,) 
-3% 0% 3% fx (x) 
_. Negative -1 25% 15% 0% 40% 
> 
FA Neutral 0 5% 10% 15% 30% 
E Positive l 0% 5% 25% 30% 


fx, (x) 30% 30% 40% 


The conditional distribution for f,4),9(x,|X2) = -1. is summarized as follows: 
Retura -3% 0% 3% 
Probability 62.5% 37.5% 0.0% 


The conditional expectation of the return given that the earnings announcement is 
negative is then computed as: 


E|X, |X, = 1| = 3% x 62.5% + 0% x 37.5% + 3% x 0% = 1.875% 


2) MODULE QUIZ 15.3 


1. What is the variance of a two-asset portfolio given the following covariance matrix 
and a correlation between the two assets of 0.25? Assume the weights in Asset 1 and 
Asset 2 are 40% and 60%, respectively. 

(oF an) fø 10%? =p x 10% x 4% 
| >) b sin omni 4%? ) 


A. 0.27%. 
B. 0.79%. 
C. 1.47%. 
D. 2.63%. 


2. Suppose a portfolio manager creates a conditional PMF based on analyst ratings, X2. 
Analysts’ ratings can take on three possible outcomes: an upgrade, X> =1; a 
downgrade, X> = -1; or a neutral no change rating, Xz =0. What is the conditional 


expectation of a return given an analyst upgrade and the following conditional 
distribution for fy, |x% |% = 1? 


Return -4% o% 4% 
Probability 12.5% 23.5% 64.0% 
A. 2.06%. 
B. 3.05%. 


C. 4.40%. 
D. 11.72%. 


MODULE 15.4: INDEPENDENT AND IDENTICALLY 
DISTRIBUTED RANDOM VARIABLES 


LO 15.i: Describe the features of an independent and identically distributed (iid) 
sequence of random variables. 


LO 15.j: Explain how the iid property is helpful in computing the mean and 
variance of a sum of iid random variables. 


Independent and identically distributed (i.i.d.) random variables are generated 
from a single univariate distribution such as the normal distribution. Features of i.i.d. 
sequence of random variables include the following: 

= Variables are independent of all other components. 

= Variables are all from a single univariate distribution. 

= Variables all have the same moments. 


= Expected value of the sum of n i.i.d. random variables is equal to nu. 


= Variance of the sum of n i.i.d. random variables is equal to no?. 


= Variance of the sum of i.i.d. random variables grows linearly. 


= Variance of the average of multiple i.i.d. random variables decreases as n increases. 


Determining the mean and variance of i.i.d. random variables is relatively easy because 
the variables are independent and have the same moments. The expected value of the 
sum ofn i.i.d. random variables is simply equal to nu. All i.i.d. random variables are 
from the same univariate distribution and thus have the same mean, p. The expectation 
of a sum is always the sum of the expectations. In this case, we assume all variables are 
identical. Thus, the sum of the means is simply a linear scale based on n as follows: 


E dx, = È E[X;]= Yu=nyp 
i=1 1= 1=1 


Similarly, the variance of n ii.d. random variables is equal to no*. This result is only 
true if the variables are independent of each other in addition to identical. This can be 
illustrated with the following equations. 


The variance of i.i.d. random variables is computed as: 
n 


n n n 
$x, = ZVar[Xi}+22, È Covex,,Xy) 
i= i= 


j=lk=j+i 


Var 


Because all variables are independent, the covariances of all variables must be equal to 
Zero. 


This results in the second term of the previous equation dropping out, and we are 
simply left with the sum of all variances. Since all i.i.d. random variables have the same 
expected mean and variance, the variance of a sum of i.i.d. random variables is equal to 


no’. 


Var | > x, = no? 
i=l 


The variance of the sum of multiple random variables grows linearly based on n. Thus, 
for two i.i.d. random variables, X,and X,, the variance will be 20°. 


Var[X, + X,] = 20? 


An important implication for this when estimating unknown parameters is that the 
variance of the average reduces as n increases. In other words, with a larger n, the 
expected average will be closer to the true unknown mean, p. 


=) MODULE QUIZ 15.4 


1. Which of the following statements regarding the sums of i.i.d. normal random 
variables is incorrect? 


A. The sums of i.i.d. normal random variables are normally distributed. 

B. The expected value of a sum of three i.i.d. random variables is equal to 34. 
C. The variance of the sum of four i.i.d. random variables is equal to 60°. 

D. The variance of the sum of i.i.d. random variables grows linearly. 


2. The variance of the average of multiple i.i.d. random variables: 
A. increases as n increases. 
B. decreases as n increases. 
C. increases if the covariance is negative as n increases. 
D. decreases if the covariance is negative as n increases. 


KEY CONCEPTS 


LO 15.a 
A probability matrix of a discrete bivariate random variable distribution describes the 
outcome probabilities as a function of the coordinates x, and x3. All probabilities in the 


matrix are positive or zero, are less than or equal to 1, and the sum across all possible 
outcomes for X4 and X, equals 1. 


LO 15.b 

A marginal distribution defines the distribution of a single component of a bivariate 
random variable (i.e., a univariate random variable). A conditional distribution sums 
the probabilities of the outcomes for each component conditional on the other 
component being a specific value. 


LO 15.c 
The expectation of a bivariate random function g(X,,X2) is a probability-weighted 
average of the function of the outcomes g(x,,x>). 


LO 15.d 


Covariance is the expected value of the product of the deviations of the two random 
variables from their respective expected values. It measures how two variables move 
with each other. 


LO 15.e 


The correlation coefficient is a statistical measure that standardizes the covariance as 
follows: 


i ! 
Corr(X,,X.,) = 
i é Oo, Jd 


LO 15.f 

The first effect of a linear transformation on the covariance of two random variables is 
that b determines the correlation between the components. The correlation between X, 
and X, will be 1 if b > 0, 0 if b = 0, and -1 if b <0. 


A second effect of linear transformations on covariance is that the amount or scale of a 
has no effect on the variance, and the scale of b determines the scale or changes in the 


variance by þ?. 


A third effect of linear transformations on covariance is that the scale of covariance is 
determined by two variables, b and d, as follows: 
Cov[a + bX4,c + dX,] = bdCov[X;,X3] 


The fourth effect of linear transformations on covariance between random variables 
relates to coskewness and cokurtosis. 


LO 15.9 
The variance of a two-asset portfolio, X, and X>, with weights of a and b, respectively is: 


Var[aX, + bX,] = a’Var[X,] + b2Var[X>] + 2ZabCov[X,,X>] 


LO 15.h 


In the context of portfolio risk management, a conditional expectation of a random 
variable is computed based on a specific event occurring. A conditional distribution is 
defined based on the conditional probability for a bivariate random variable X, given 


Xo 


LO 15.i 
Independent and identically distributed (i.i.d.) random variables 
= are independent of all other components, 


= are all from a single univariate distribution, and 


= all have the same moments. 


LO 15.j 
Features of the sum of n i.i.d. random variables include the following: 


= The expected value of the sum of n i.i.d. random variables is equal to nu. 


= The variance of the sum of n iid. random variables is equal to no? 


= The variance of the sum of i.i.d. random variables grows linearly. 


= The variance of the average of multiple i.i.d. random variables decreases as n 
increases. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 15.1 


1.D The marginal distribution for a positive analyst rating is computed by summing 
the third row consisting of all possible outcomes of a positive rating as follows: 
fx(1) = 0% + 5% + 25% = 30% 

(LO 15.b) 


2.A A conditional distribution is defined based on the conditional probability for a 
bivariate random variable X, given X>. All possible outcomes of a positive analyst 
rating are found in the third row of the bivariate probability matrix (x, = 1) as 
0%, 5%, and 25% for monthly returns of -6%, 0%, and 6%, respectively. These 
joint probabilities are then divided by the marginal probability of a positive 
analyst rating, which is computed as 0% + 5% + 25% = 30%. Thus, the conditional 
distribution for X, = 1 is computed as 0% / 30%, 5% / 30%, and 25% / 30% and 
summarized as follows: 


Return 6% 0% 6% 
Probability 0% 16.7% 83.3% 
(LO 15.b) 


Module Quiz 15.2 
1.D The expectation is computed as follows: 


Efg(x ;, x>)]= 340.35) +3“(0.20) +67(0.15)+6*(0.30) 


= 3,15 + 16.20 + 5.40 + 388,80 = 413.55 
(LO 15.c) 


2.C Correlation will standardize the data and remove the difficulty in interpreting the 
scale difference between variables. Correlation is determined by dividing the 
covariance of the two variables by the product to the two variables’ standard 
deviations. The formula for correlation is as follows: 


71 


Corr(X,,X,) = 
= G19 


(LO 15.e) 


Module Quiz 15.3 
1.A The variance of this two-asset portfolio is computed as: 


o2, = (0.40°(0.10)? + (0.60)2(0.04)? + 2(0.40X0.60)(0.25 x 0.10 x 0.04) 
= (0.16)(0.01) + (0.36)(0.0016) + 0.00048 


= 0.00160 + 0.00058 + 0.00048 = 0.00266 or 0.27% 
(LO 15.g) 


2.A The conditional expectation of the return given a positive analyst upgrade is 
computed as: 


Elx,1z, = i] = 


4% x 12.5% + 0% x 23.5% + 4% x 64.0% 


= -0.005 + 0.0 + 0.0256 = 0.0206 or 2.06% 
(LO 15.h) 


Module Quiz 15.4 


1.C The variance of the sum of n i.i.d. random variables is equal to no’. Thus, for four 
iid. random variables, the sum of the variance would be equal to 402. The 


covariance terms are all equal to zero because all variables are independent. (LO 
154) 


2.B The variance of the average of multiple i.i.d. random variables decreases as n 
increases. The covariance of i.i.d. random variables is always zero. (LO 15,j) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 5. 


READING 16 
SAMPLE MOMENTS 


Study Session 5 


EXAM FOCUS 


This reading explains how sample moments (mean, variance, skewness, and kurtosis) 
are used to estimate the true population moments for data generated from independent 
and identically distributed (i.i.d.) random variables. For the exam, be able to estimate 
these sample moments and explain the differences from population moments. Also, be 
prepared to discuss what makes estimators biased, unbiased, and consistent. In 
addition, be able to discuss the law of large numbers (LLN) and the central limit 
theorem (CLT). Lastly, be prepared to contrast the advantages of estimating quantiles to 
traditional measures of dispersion. 


MODULE 16.1: ESTIMATING MEAN, VARIANCE, AND 
STANDARD DEVIATION 


LO 16.a: Estimate the mean, variance, and standard deviation using sample data. 


The sample mean, {i, is estimated by dividing the sum of all the values in a sample of a 
population, XX, by the number of observations in the sample, n. It is used to make 
inferences about the population mean. The sample mean is expressed as: 

ÈX; 


i=] 


A- 


The sum of the deviations of each observation in the data set from the mean is always 
Zero. 


The arithmetic mean is the only measure of central tendency for which the sum of the 
deviations from the mean is zero. Mathematically, this property can be expressed as 
follows: 


sum of mean deviations = } (X; - 
=I 


ii) = 0 


The deviations are squared to estimate the variance of the sample. The biased sample 
estimator of variance for a sample of n i.i.d. random variables X; is computed: 


The square root of the variance is called the standard deviation. The variance and 
standard deviation measure the extent of the dispersion in the values of the random 
variable around the mean. 


EXAMPLE: Estimating the mean, variance, and standard deviation with sample 
data 


Assume you are evaluating the stock of Alpha Corporation. You have calculated the 
stock returns for Alpha Corporation over the last five years to develop the following 
sample data set. Given this information, calculate the sample mean, variance, and 
standard deviation. 


Data set: 24%, 34%, 18%, 54%, 10% 


Answer: 


a 0.24 + 0.34 + 0.18 + 0.54 +0.10 
{i = sample mean = = 0.28 = 28.0% 


5 


The calculation of the sample variance can be computed using the following table: 


X; Mean Deviation aae 
0.24 0.28 -0.04 0.0016 
0.34 0.28 0.06 0.0036 
0.18 0.28 0.10 0.01 
0.54 0.28 0.26 0.0676 
0.10 0.28 -0.18 0.0324 

0.1152 


From the table, the first step is to compute the deviation from the mean. In the third 
column, the mean is subtracted from the observed value, X;. In the fourth column, the 
deviations from the mean in the third column are squared. The sum of all squared 
deviations is equal to 0.1152. This amount is then divided by the number of 
observations to compute the variance of 0.023 (= 0.1152/5). 


The biased standard deviation for this sample is then computed as: 


v0.023 = 0.1517 or 15.17% 


The calculations in the previous example result in a biased estimate of the variance and 
standard deviation. Because the bias is known, the estimate of variance and standard 
deviation should be divided by (n - 1) and not n. (This will be discussed later in this 
reading.) 


Therefore, the unbiased estimate for variance is computed by dividing the sum of all 
squared deviations by (n - 1). 


Given the data in the previous example, this results in an unbiased estimate of 0.0288 
for the variance (= 0.1152/4). The unbiased estimate of the standard deviation is then 
0.1697 or 16.97%. 


aS PROFESSOR'S NOTE 
Unless you are specifically instructed on the exam to compute a biased 
variance, you should always compute the unbiased variance by dividing by (n 
- 1). 


Population and Sample Moments 


LO 16.b: Explain the difference between a population moment and a sample 
moment. 


Measures of central tendency identify the center, or average, of a data set. This central 
point can then be used to represent the typical, or expected, value in the data set. The 
first moment of the distribution of data is the mean. 


To compute the population mean, y, all the observed values in the population are 
summed and divided by the number of observations in the population, N. Note that the 
population mean is unique in that a given population has only one mean. The 
population mean is expressed as: 


ÈX; 
H=- 
The population mean is unknown because not all of the random numbers of the 
population are observable. Therefore, we create samples of data to estimate the true 


population mean. The hat notation above the u, denotes that the sample mean, {i, is an 
estimate of the true mean. 


n r 
Sx 


Ja 


The sample mean is an estimate based on a known data set where all data points are 
observable. Thus, the sample mean is simply an estimate of the true population mean. 
Note the use of n, the sample size, versus N, the population size. 


The population mean and sample mean are both examples of arithmetic means. The 
arithmetic mean is the sum of the observed values divided by the number of 
observations. It is the most widely used measure of central tendency and has the 
following properties: 


a All interval and ratio data sets have an arithmetic mean. 


= All data values are considered and included in the arithmetic mean computation. 
= A data set has only one arithmetic mean (i.e., the arithmetic mean is unique). 


The following example illustrates the difference between the sample mean and the 
population mean. 


EXAMPLE: Estimating the mean with different sample sizes 


Assume you and your research assistant are evaluating the stock of Beta 
Corporation. You have calculated the stock returns for Beta Corporation over the last 
12 years to develop the following data set. Your research assistant has decided to 
conduct his analysis using only the returns for the five most recent years, which are 
displayed as the bold numbers in the data set. Given this information, calculate the 
two sample means and discuss the population mean. 


Data set: 12%, 25%, 34%, 15%, 19%, 44%, 54%, 33%, 22%, 28%, 17%, 24% 
Answer: 
fi= 1* sample mean = 


12-2) ta tO te Oe SOs Seo Af ae 


12 


= 27.25% 


25+ 34+ 19+54+17 
sample mean = = = 29.8% 
5 


and 


t= 


The population mean is the expected value of all possible random returns for the 
company. Because all possible random observations of returns are not observable, 
sample estimates are used to estimate the true population mean. A larger sample 
size results in an estimate that is closer to the true unobservable population mean. 


Unusually large or small values can have a disproportionate effect on the computed 
value for the arithmetic mean. For example, the mean of 1, 2, 3, and 50 is 14 and is not a 
good indication of what the individual data values really are. On the positive side, the 
arithmetic mean uses all the information available about the observations. The 
arithmetic mean of a sample from a population is the best estimate of both the true 
mean of the sample and the value of the next observation. 


Variance and Standard Deviation 


The mean and variance of a distribution are defined as the first and second moments of 
the distribution, respectively. The variance of an estimator or sample mean can be 
calculated using the standard properties of random variables as the sum of the 
variances and covariances: 


l n lia 
= —Var| $x, = — Evan(x) +Coy| 


loa 
Var{fi] = Var EEX, ) 
i= n“ n4 Li= 


i=l 
All covariances are 0, because the X; are all uncorrelated random variables that are i.i.d. 
This results in the second term in the brackets dropping out. The estimate for variance 


then simplifies to: 


l l 3 T 
= T t T 


n- 


¥ Var(X,) +Cov 


i=l 


Var{fi] = 


n- 
Thus, the variance of the mean estimator depends on the variance of the sample data 
and the number of observations. If data is more variable, then it is more difficult to 
estimate the true variance. The variance of the mean estimator will decrease when the 
size of the sample or number of observations is increased. Therefore, a larger sample 
size helps to reduce the difference between the estimated variance and the true 
variance of the population. 


The variance of a random variable is defined as: 
o? = Var{X] = E|(X — E[XI)?}] 
The population moments are transformed into estimator moments by replacing the 


expected operator of a random variable, E[X], with an averaging operator that divides 
by the number of observations, n, such that: 


Point Estimates and Estimators 


LO 16.c: Distinguish between an estimator and an estimate. 


Sample parameters can be used to draw conclusions about true population parameters 
which are unknown. Point estimates are single (sample) values used to estimate 
population parameters, and the formula used to compute a point estimate is known as 
an estimator. The hat notation, (i, in the following formula denotes that the estimator 
formula of the mean is used to estimate the true unknown mean parameter, p. 


i=l 


n 


fi = 


Sample data is then used instead of random data from a population, X,. The mean 


estimator is a formula that transforms data into an estimate of the true population 
mean using observed data from a sample of the population. 


Biased Estimators 


LO 16.d: Describe the bias of an estimator and explain what the bias measures. 


The bias of an estimator measures the difference between the expected value of the 
estimator, E [4], and the true population value, 8. Therefore, the estimator bias is 
computed as: 


Bias(ĝ) = E[6] — 8 


The expected value of the mean estimator is equal to the true population mean. When 
X; consists of i.i.d. random variables, the mean estimator is equal to the true population 


mean, u. The following equation illustrates that the mean estimator bias is zero, 
because the expected mean estimator is equal to the true population mean. 


Bias(f) = Elf] pp = p-p = 0 


Therefore, the sample mean is an unbiased estimator. Conversely, the sample variance is 


a biased estimator. The bias for the estimator is based on the sample size n. The 


2 


expected sample variance, E [é], is a function of the true population variance, o“, and 


the number of observations, n: 


oœ n-l, 
ee = ot P= 


The sample variance is then computed as: 


aT 9 = n—l, , o? 
Bias(é*) = Ela] — a = = = 


Thus, when the sample size n is large, the bias is small. The fact that the bias is known 
allows us to determine an unbiased estimator for the sample variance as: 


n n = l n 2 
s* = o- = >: (X, — u) 
n— 1 n-li ' 


Note that based on the mathematical theory behind statistical procedures, the use of 
the entire number of sample observations, n, instead of n — 1 as the divisor in the 
computation of s, will systematically underestimate the population parameter, 0°, 
particularly for small sample sizes. This systematic underestimation causes the sample 
variance to be a biased estimator of the population variance. Using n - 1 instead of n in 
the denominator, however, improves the statistical properties of s* as an estimator of 
o”. Thus, s”, as expressed in the equation, is considered to be an unbiased estimator of 


oł. 


Best Linear Unbiased Estimator 


LO 16.e: Explain what is meant by the statement that the mean estimator is BLUE. 


The best linear unbiased estimator (BLUE) is the best estimator of the population 
mean available because it has the minimum variance of any linear unbiased estimator. 
When data is i.i.d., the sample mean is considered to be BLUE. 


The following equation denotes how linear estimators of the mean are computed: 
ñ= }w,X 
Where w; are the weights that are independent of X; (i.e., w; =1/n). 


Because the observations are equally likely, the weights are all equal to 1/n. An 
unbiased estimator is one for which the expected value of the estimator is equal to the 
parameter you are trying to estimate. For example, the sample mean is an unbiased 


estimator of the population mean, because the expected value of the sample mean is 
equal to the population mean. 


Note that there may be other nonlinear estimators that are better at estimating the 
true parameters of a distribution. For example, maximum likelihood estimators of the 
population mean may be more accurate. However, these estimators are nonlinear and 
are often biased in finite samples. 


S MODULE QUIZ 16.1 


1. A risk manager gathers the following sample data to analyze annual returns for an 
asset: 12%, 25%, and -1%. He wants to compute the best unbiased estimator of the 
true population mean and standard deviation. The manager's estimate of the standard 
deviation for this asset should be closest to: 

A. 0.0111. 
B. 0.0133. 
C. 0.1054. 
D. 0.1300. 


2. The sample mean is an unbiased estimator of the population mean because the: 
A. sampling distribution of the sample mean is normal. 
B. expected value of the sample mean is equal to the population mean. 
C. sample mean provides a more accurate estimate of the population mean as the 
sample size increases. 
D. sampling distribution of the sample mean has the smallest variance of any other 
unbiased estimators of the population mean. 


MODULE 16.2: ESTIMATING MOMENTS OF THE 
DISTRIBUTION 


LO 16.f: Describe the consistency of an estimator and explain the usefulness of 
this concept. 


LO 16.g: Explain how the Law of Large Numbers (LLN) and Central Limit 
Theorem (CLT) apply to the sample mean. 


Law of Large Numbers 

If the law of large numbers (LLN) applies to estimators, then the estimators are 
consistent. The first property of a consistent estimator is that as the sample size 
increases, the finite sample bias is reduced to zero. The second property of a consistent 
estimator is as the sample size increases, the variance of the estimator approaches zero. 
The properties of consistency ensure that estimates from large samples have small 
deviations from the true population mean. This is an important concept that ensures 
that the estimate of the mean and variance will be very close to the true mean and 
variance of the population in large sample sizes. Thus, increasing the sample size 
results in better estimates of the true population distribution. 


Central Limit Theorem 

The central limit theorem (CLT) states that for simple random samples of size n from 
a population with a mean p and a finite variance o2, the sampling distribution of the 
sample mean, p, approaches a normal probability distribution with mean u and 
variance equal to o?/n as the sample size becomes large. The CLT requires only one 
additional assumption from the LLN that the variance is finite. The LLN only requires 
the assumption that the mean is finite. In addition, the CLT does not require 
assumptions about the distribution of the random variables of the population. No 
assumption regarding the underlying distribution of the population is necessary 
because, when the sample size is large, the sums of i.i.d. random variables (the 
individual items drawn for the sample) will be normally distributed. 


The CLT is extremely useful because the normal distribution is easily applied in testing 
hypotheses and constructing confidence intervals. Specific inferences about the 
population mean can be made from the sample mean, regardless of the population’s 
distribution, as long as the sample size is sufficiently large, which usually means n 2 30. 
As the sample size increases, the sample distribution appears to be more normally 
distributed. 


Important properties of the central limit theorem include the following: 


= Ifthe sample size n is sufficiently large, the sampling distribution of the sample 
means will be approximately normal. Remember what’s going on here: random 
samples of size n are repeatedly being taken from an overall larger population. Each 
of these random samples has its own mean, which is itself a random variable, and this 
set of sample means has a distribution that is approximately normal. 

= The mean of the population, p, and the mean of the distribution of all possible sample 
means are equal. 


= The variance of the distribution of sample means is o7/n, the population variance 
divided by the sample size. Thus, it approaches zero as the sample size increases. 


Skewness and Kurtosis 


LO 16.h: Estimate and interpret the skewness and kurtosis of a random variable. 


The skewness statistic is the standardized third central moment of the distribution. 
Skewness (sometimes called relative skewness) refers to the extent to which the 
distribution of data is not symmetric around its mean. It is calculated as: 
E[(X — E[X])"] H3 
Skewness(X) = —— 7 = — 
E(X- EIX] = 
The estimator for the third moment is computed as: 


È (X-A) 


ni 


~¥% 
— 


Nonsymmetrical distributions may be either positively or negatively skewed and result 

from the occurrence of outliers in the data set. Outliers are observations with 

extraordinarily large values, either positive or negative. 

« A positively skewed distribution is characterized by many outliers in the upper 
region, or right tail. A positively skewed distribution is said to be skewed right 
because of its relatively long upper (right) tail. 


= A negatively skewed distribution has a disproportionately large amount of outliers 
that fall within its lower (left) tail. A negatively skewed distribution is said to be 
skewed left because of its long lower tail. 


Figure 16.1 illustrates that skewness affects the location of the mean, median, and mode 
of a distribution. The mean is the arithmetic average, the median is the middle of the 
ranked data in order, and the mode is the most probable outcome. 

= For asymmetrical distribution, the mean, median, and mode are equal. 


= Fora positively skewed, unimodal distribution, the mode is less than the median, 
which is less than the mean. The mean is affected by outliers; in a positively skewed 
distribution, there are large, positive outliers that will tend to pull the mean upward, 
or more positive. An example of a positively skewed distribution is that of housing 
prices. Suppose you live in a neighborhood with 100 homes; 99 of them sell for 
$100,000, and one sells for $1 million. The median and the mode will be $100,000, 
but the mean will be $109,000. Hence, the mean has been pulled upward (to the right) 
by the existence of one home (outlier) in the neighborhood. 

= For a negatively skewed, unimodal distribution, the mean is less than the median, 
which is less than the mode. In this case, there are large, negative outliers that tend to 
pull the mean downward (to the left). 


Figure 16.1: Effect of Skewness on Mean, Median, and Mode 
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The kurtosis statistic is the standardized fourth central moment of the distribution. 
Kurtosis refers to how fat or thin the tails are in the data distribution and is calculated 
as: 
E[((X—E[X])*] H4 
Kurtosi{ X) = ———————————— = — 


E[(x-E[x)J? 7 


The estimator for the fourth moment is computed as: 


Kurtosis for the normal distribution equals 3. Distributions with a kurtosis greater than 
3 are referred to as heavy-tailed or fat-tailed. Many software packages report excess 
kurtosis for any distribution as excess kurtosis = kurtosis - 3. Thus, a normal 
distribution has excess kurtosis equal to zero. 


Figure 16.2 illustrates that relative to a normal distribution, a leptokurtic distribution 
will have a greater percentage of extremely large deviations from the mean (i.e., fat 


tails). This means there is a relatively greater probability of an observed value being far 
from the mean. With regard to an investment returns distribution, a greater likelihood 
of a large deviation from the mean return is often perceived as an increase in risk. Note 
that a distribution that has thinner tails than a normal distribution if referred to as a 
platykurtic distribution. 


Figure 16.2: Kurtosis 


Kurtosis is critical in a risk management setting. Most research about the distribution 
of securities returns has shown that returns are not normally distributed. Actual 
securities returns tend to exhibit both skewness and kurtosis. Skewness and kurtosis 
are critical concepts for risk management because when securities returns are modeled 
using an assumed normal distribution, the predictions from the models will not take 
into account the potential for extremely large, negative outcomes. In fact, most risk 
managers put very little emphasis on the mean and standard deviation of a distribution 
and focus more on the distribution of returns in the tails of the distribution—that is 
where the risk is. In general, greater positive kurtosis and more negative skew in return 
distributions indicates increased risk. 


Median and Quantile Estimates 


LO 16.i: Use sample data to estimate quantiles, including the median. 


The median is the 50th percentile or midpoint of a data set when the data is arranged 
in ascending or descending order. It is similar to the mean because both measure the 
central tendency of the data. If the data is symmetrical then the mean and median are 
the same when half the observations lie above the median and half are below. 


The median is important because the arithmetic mean can be affected by extremely 
large or small values (outliers). When this occurs, the median is a better measure of 
central tendency than the mean because it is not affected by extreme values that may 
possibly be errors in the data. 


Estimating Quantiles 

To determine the median and other quantiles, arrange the data from the highest to the 
lowest value, or lowest to highest value, and find the middle observation. The middle of 
the observations will depend on whether the total sample size is an odd or even 
number. 


The median is estimated when the total number of observations in the sample size is 
odd as: 


Median(x) = x 


“(nti )/2 


The median is estimated when the total number of observations in the sample size is 
even as: 


Median(x) = 0.5 x (X,/2 + x 


n 21) 


EXAMPLE: Estimating the median using an odd number of observations 
What is the median return for five portfolio managers with 10-year annualized total 
returns of: 30%, 15%, 25%, 21%, and 23%? 
Answer: 
First, arrange the returns in descending order. 
30%, 25%, 23%, 21%, 15% 


Then, select the observation that has an equal number of observations above and 
below it—the one in the middle. For the given data set, the third observation, 23%, is 
the median value. 


EXAMPLE: Estimating the median using an even number of observations 


Suppose we add a sixth manager to the previous example with a return of 28%. What 
is the median return? 


Answer: 
Arranging the returns in descending order gives us: 
30%, 28%, 25%, 23%, 21%, 15% 


With an even number of observations, there is no single middle value. The median 
value in this case is the arithmetic mean of the two middle observations, 25% and 
23%. Thus, the median return for the six managers is 24.0% = 0.5(25 + 23). 


Estimating Quartiles 

In addition to the median, the two most commonly reported quantiles are the 25th and 
75th quantiles. The estimation procedure for these quantiles is similar to the median 
process. The data is first sorted and then the a-quantile is estimated using the data 
point in location a x n. If this data value is not an integer value, then the general rule is 
to average the points immediately above and below a x n. 


An interquartile range (IQR) is a measure of dispersion from the median similar to 
the measure of standard deviation from the mean. A common IQR is the range from the 
25th to 75th quartile. These measures are useful in determining the symmetry of the 
distribution and weight of the tails. 


There are two properties of quantiles that make them valuable in data analysis: 


= The interpretation of the quantiles is easy because they have the same units as the 
sample data. In other words, there is a 25% probability of obtaining an observation 
that is in the quartile. 


= Quantiles are a robust measure for outliers or extreme values from the mean. In other 
words, the median and the IQR are not impacted by outliers. Conversely, the mean is 
impacted by outliers. 


Mean of Two Random Variables 


LO 16.j: Estimate the mean of two variables and apply the CLT. 


The mean of two random variables is estimated the same way as the mean for 
individual variables. The arithmetic average of the sample is determined by adding up 
all values and dividing by the number of observations in the sample, n. Thus, the 
formulas for estimating the means of two random variables, X; and Y; are: 


. x 
H y = m Fy 
If the data is i.i.d., then the CLT applies to both estimators. If the two mean estimators 
are considered as a bivariate mean estimate, u, we can apply the CLT and examine the 
joint behavior by stacking the two mean estimators into a vector: 

fly 


fi = |, 
Hy 


If the multivariate random variable Z = [X, Y] is i.i.d. then the 2 by 1 vector is 
asymptotically normally distributed (i.e. the estimator converges to the normal 
distribution as sample size increases). 


Covariance and Correlation Between Random Variables 


LO 16.k: Estimate the covariance and correlation between two random variables. 


The covariance between two random variables is a statistical measure of the degree to 
which the two variables move together. The covariance captures the linear relationship 
between one variable and another. A positive covariance indicates that the variables 
tend to move together; a negative covariance indicates that the variables tend to move 
in opposite directions. Because we will be mostly concerned with the covariance of 
asset returns, the following formula has been written in terms of the covariance of the 
return of asset X, and the return of asset Y: 


Cov(X, Y) = E{[X — E(X)][Y — E(y)]} 
This equation simplifies to: 


Cov(X, Y) = E(X,Y) — E(X) x E(Y) 


The sample covariance estimator can be calculated as: 


2 IX, - fly (Y; - fly) 

i= 

sample Cov... = 
p nt n— i 

EXAMPLE: Covariance 


Assume that the economy can be in three possible states (S) next year: boom, 
normal, or slow economic growth. An expert source has calculated that P(boom) = 
0.30, P (normal) = 0.50, and P(slow) = 0.20. The returns for Stock A, Rag, and Stock B, 


Rp, under each of the economic states are provided in the following table. What is 
the covariance of the returns for Stock A and Stock B? 


Answer: 

First, the expected returns for each of the stocks must be determined. 
E(Ra) = (0.3)(0.20) + (0.5)(0.12) + (0.2)(0.05) = 0.13 
E(Rg) = (0.3)(0.30) + (0.5)(0.10) + (0.2)(0.00) = 0.14 


The covariance can now be computed using the procedure described in the following 
table. 


Covariance Computation 


Event P(S) Ry Ra P(S) x [R, - E(R,)] x [R, — E(Ry)] 


Boom 0.3 0.20 0.30 (0.3)0.2 — 0.13)(0.3 — 0.14) = 0.00336 
Normal 0.5 0.12 0.10 (0.5)(0.12 — 0.13X0.1— 0.14) = 0.00020 
Slow 0.2 0.05 0.00 (0.2)(0.05 — 0.130 — 0.14) = 0.00224 


Cov(R,, R) = EP(S) x [Ry — E(R,)] x [Rp— E(R,)] = 0.00580 


The actual value of the covariance is not very meaningful because its measurement is 
extremely sensitive to the scale of the two variables. Also, the covariance may range 
from negative to positive infinity and it is presented in terms of squared units (e.g, 
percent squared). For these reasons, we take the additional step of calculating the 
correlation coefficient, which converts the covariance into a measure that is easier to 
interpret: 
Cov(X, Y) 
Com(X,Y) = ———_ 
eT oe 


EXAMPLE: Correlation 

Using our previous example, compute and interpret the correlation of the returns 
for Stocks A and B, given that 0? (Ra) = 0.0028 and o? (Rg) = 0.0124 and recalling that 
Cov(R,,Rp) = 0.0058. 


Answer: 
First, it is necessary to convert the variances to standard deviations. 


o(Ra) = (0.0028)! = 0.0529 
o(Rg) = (0.0124) = 0.1114 


Now, the correlation between the returns of Stock A and Stock B can be computed as 
follows: 
0.0058 


Corr(R, Rg) = = 0.9842 
(0.0529)(0.1114) 


Coskewness and Cokurtosis 


LO 16.1: Explain how coskewness and cokurtosis are related to skewness and 
kurtosis. 


Previously, the first and second moments of mean and variance were applied to pairs of 
random variables. We can also apply techniques to identify the third and fourth 
moments for pairs of random variables that are similar to the measurements of 
skewness and kurtosis for individual variables. The third cross central moment is 
known as coskewness and the fourth cross central moment is known as cokurtosis. 


There are p — 1 measures required for computed the pth moment. Figure 16.3 
summarizes the number of measures required for each cross moment. 


Figure 16.3: Cross Moment Measurements 


Cross Moment Number of Measurements 


Ist 0 cross means 
2nd | covariance (cross variance) 
3rd 2 coskewness (cross skewness) 
4th 3 cokurtosis (cross kurtosis) 


Dividing by the variance of one variable and the standard deviation of the other 
variable standardizes the cross third moment. The two coskewness measures are 
computed as: 
EI(x ~ Ex) cy - E[Y])] 
<X,X,¥) = ————— 
xy 
EX EX] Y — Ely) 


~ 


OyO 
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Coskewness measures the likelihood of large directional movements occurring for one 
variable when the other variable is large. Coskewness measures are zero when there is 
no relationship between the sign of one variable when large moves occur with the 


other variable. Coskewness is always zero in a bivariate normal sample because the 
data is symmetrical and normally distributed. 


We can estimate coskewness by applying an expectation operator as follows: 


l > 

-D (X; — fix (Y; — fiy) 
§(X,X,Y) = = 

xOy 


The three cokurtosis measures are computed as: 


E[(X — Etx))?(Y — Efy))?] 


k(X,X,¥,Y) = a 
0% 04 

E[(X — EIXIPY — EYI] 

k(X.X,X, Y) = —— 
TX Ty 

E((X — E[X])(Y — Ery))"] 

Mx Y¥Y) = — 


ov ar 


27% 

The cokurtosis is computed using combinations of powers that add to 4. Note that the 
first cokurtosis measurement k(X,X,Y,Y) is for the symmetrical case where there are two 
measurements from each variable (2,2). The asymmetric configurations are (1,3) and 
(3,1) where one of the variables measures to the third power and the other to the first 
power. 


The symmetrical case provides the sensitivity of the magnitude of one series to the 
magnitude of the other series. The cokurtosis measure will be large if both series are 
large in magnitude at the same time. The other two asymmetrical cases indicate the 
agreement of the return signs when the power 3 return is large in magnitude. 


The cokurtosis of a bivariate normal depends on the correlation. Figure 16.4 illustrates 
the relationship between cokurtosis and correlation for normal data and the symmetric 
case, k(X,X,Y,Y). Notice that the correlation ranges between -1 and +1 and the 
cokurtosis ranges between 1 and 3, with the smallest value of 1 occurring when the 
correlation is equal to zero. When the correlation is zero, the returns are uncorrelated 
with one another because both random variables are normally distributed. The 
cokurtosis then goes up symmetrically the further the correlation is away from zero. 


Figure 16.4: Cokurtosis and Correlation for Symmetric Case 
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Figure 16.5 illustrates the relationship between cokurtosis and correlation for normal 
data and the asymmetrical cases where one series is to the power of three and the 
other is to the first power. The asymmetric cokurtosis ranges from -3 to +3 and is a 
linear relationship that is upward sloping as the correlation increases from -1 to +1. 


Figure 16.5: Cokurtosis and Correlation for Asymmetric Cases 


3 


Nm 


Cokurtosis 


Correlation 


=) MODULE QUIZ 16.2 


1. A junior analyst is assigned to estimate the first and second moments for an 
investment. Sample data was gathered that is assumed to represent the random data 
of the true population. Which of the following statements best describe the 
assumptions that are required to apply the central limit theorem (CLT) in estimating 
moments of this data set? 

A. Only the variance is finite. 

B. Both the mean and variance are finite. 

C. The random variables are normally distributed. 

D. The mean is finite and the random variables are normally distributed. 


2. A distribution of returns that has a greater percentage of extremely large 
deviations from the mean: 


A. is positively skewed. 
B. is asymmetric distribution. 
C. has positive excess kurtosis. 


D. has negative excess kurtosis. 

3. The correlation of returns between Stocks A and B is 0.50. The covariance between 
these two securities is 0.0043, and the standard deviation of the return of Stock B 
is 26%. The variance of returns for Stock A is: 

A. 0.0331. 
B. 0.0011. 
C. 0.2656. 
D. 0.0112. 


4. Consider the following probability matrix: 


Probability Matrix 
Returns Rp=50% Rp=20% Rg=-30% 


Ry =-10% 40% 0% 0% 

Ry = 10% 0% 30% 0% 

Ra = 30% 0% 0% 30% 
The covariance between Stock A and B is closest to: 
A. -0.160. 

B. -0.055. 

C. 0.004. 

D. 0.020. 


5. An analyst is graphing the cokurtosis and correlation for a pair of bivariate random 
variables that are normally distributed. For the symmetrical case of the three 
cokurtosis measures, k(X,X,Y,Y), cokurtosis is graphed on the y-axis and correlation 
is graphed on the x-axis between -1 and +1. The shape of this graph should be best 
described as a(n): 

A. upward linear graph ranging in cokurtosis values between -3 and +3. 

B. downward linear graph ranging in cokurtosis values between -1 and +1. 

C. symmetrical curved graph with the maximum cokurtosis of 3 when the correlation 
is O. 

D. symmetrical curved graph with the minimum cokurtosis of 1 when the correlation 
is O. 


KEY CONCEPTS 


LO 16.a 


The sample mean, fi, and sample variance, &, for a sample of n independent and 
identically distributed (i.i.d.) random variables X; are computed as: 


LO 16.b 


The sample mean is an estimator based on a known data set where all data points are 
observable. It is only an estimate of the true population mean. 


LO 16.c 


Point estimates are single (sample) values used to estimate population parameters, and 
the formula used to compute a point estimate is known as an estimator. 


LO 16.d 
The bias of an estimator measures the difference between the expected value of the 
estimator and the true population value: 


Bias(8) = E[ð] —0 


LO 16.e 


The best linear unbiased estimator (BLUE) is the best estimator of the population mean 
available because it has the minimum variance of any linear unbiased estimator. 


LO 16.f 


A consistent estimator is one that as the sample size increases, the finite sample bias is 
reduced to zero and the variance of the estimator approaches zero. 


LO 16.g 

The law of large numbers (LLN) implies estimators converge to the true population 
value or where an average of many samples converges to the expected estimator. The 
central limit theorem (CLT) states that when the sample size is large, the sums of i.i.d. 
random variables will be normally distributed. 


LO 16.h 

Skewness is the third central moment of a distribution and refers to the extent to which 
the distribution of data is not symmetric around its mean. Kurtosis is the fourth central 
moment of a distribution and refers to how fat or thin the tails are in the distribution of 
data. 


LO 16.i 
The median calculation with an odd number sample size is: 
Median(x) = x 


“in+1)/2 
The median calculation with an even number sample size is: 


Median(x) = (1/2)(x,, +X 


n 2+1) 

LO 16.j 

The formulas for estimating the means of two random variables, X; and Y; are: 
l 


nj 


g l n ; n 
Hy = — È X; and jiy J 2 X; 

n =l i 

For i.i.d. data, we can apply the CLT and examine the joint behavior by stacking the two 
mean estimators into a vector: 


i | 
f= |, 
Hy 


LO 16.k 


Covariance measures the extent to which two random variables tend to be above and 
below their respective means for each joint realization. It can be calculated as: 


Cov(X,Y) = E{[X — E(X)][Y — ECY)]} 


Correlation is a standardized measure of association between two random variables; it 
ranges in value from -1 to +1 and is equal to: 


Cov(X,Y) 

E X,Y) = ———— 

oniy) = D 
LO 16.l 


Coskewness measures the likelihood of large directional movements occurring for one 
variable when the other variable is large. Coskewness is zero when there is no 


relationship between the sign of one variable when large moves occur with the other 
variable. 


The cokurtosis of a bivariate normal depends on the correlation. Cokurtosis for the 
symmetric case, k(X,X,Y,Y), ranges between +1 and +3, with the smallest value of 1 
occurring when the correlation is equal to zero and the cokurtosis increases as the 
correlation moves away from zero. Cokurtosis for the asymmetrical cases range from 


-3 to +3 and is a linear relationship that is upward sloping as the correlation increases 
from -1 to +1. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 16.1 


1.D The calculations for the sample mean and sample variance are shown in the 
following table: 


X. Mean Deviation Squared Deviation 


0.12 0.12 0.00 0.0000 
0.25 0.12 0.13 0.0169 
0.01 0.12 -0.13 0.0169 
0.36 0.0338 


The sum of all observations of returns for the asset is 0.36. Dividing this by the number of 
observations, 3, results in an unbiased estimate of the mean of 0.12. The third column subtracts 
the mean from the actual return for each year. The last column squares these deviations from the 
mean. The sum of the squared deviations is equal to 0.338 and dividing this by 2, for an unbiased 
estimate (n - 1) instead of the number of observations, results in an estimated variance of 
0.0169. The standard deviation is then 0.13 (computed as the square root of the variance). 


(LO 16.a) 


2.B The sample mean is an unbiased estimator of the population mean, because the 
expected value of the sample mean is equal to the population mean. The best 
linear unbiased estimator (BLUE) is the best estimator of the population mean 


available because it has the minimum variance of any linear unbiased estimator. 
(LO 16.e) 


Module Quiz 16.2 


1.B The CLT requires that the mean and variance are finite. The CLT does not require 
assumptions about the distribution of the random variables of the population. (LO 
16.g) 
2.C 


A distribution that has a greater percentage of extremely large deviations from 
the mean will be leptokurtic and will exhibit excess kurtosis (positive). The 
distribution will have fatter tails than a normal distribution. (LO 16.h) 

Cov(R, .R 
3.B Corr(R,,R,) = (ars) 


|o(R.)| pee 


Cov(R,,R 2 
= Uaa) aot s| = 0.03312 = 0.0011 
= [oR y]Com(R, Ry) (0.26)(0. 5 

(LO 16.k) 
4, B Cov(R, Ry) = 0.4(-0.1 — 0.08)(0.5— 0.17) + 0.30.1 ~ 0.08X0.2— 0.17) 

+0. x0. 3 — 0.08\-0.3 — 0.17) = -0.0546 

(LO 16.k) 
5.D 


A symmetrical curved graph with the minimum cokurtosis of 1 when the 
correlation is 0. The graph will be an upward sloping linear relationship for the 
other two asymmetric cases of cokurtosis k(X,Y, YY) and k(X,X,X,Y). (LO 1611) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
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READING 17 
HYPOTHESIS TESTING 


Study Session 5 


EXAM FOCUS 


This reading provides insight into how risk managers make portfolio decisions on the 
basis of statistical analysis of samples of investment returns or other random economic 
and financial variables. We first focus on hypothesis testing procedures used to conduct 
tests concerned with population means and population variances. Specific tests 
reviewed include the z-test and the t-test. For the exam, you should be able to construct 
and interpret a confidence interval and know when and how to apply each of the test 
statistics discussed when conducting hypothesis testing. 


MODULE 17.1: HYPOTHESIS TESTING BASICS 


LO 17.a: Construct an appropriate null hypothesis and alternative hypothesis and 
distinguish between the two. 


Hypothesis testing is the statistical assessment of a statement or idea regarding a 
population. For instance, a statement could be, “The mean return for the U.S. equity 
market is greater than zero.’ Given the relevant returns data, hypothesis testing 
procedures can be employed to test the validity of this statement at a given significance 
level. 


A hypothesis is a statement about the value of a population parameter developed for 
the purpose of testing a theory or belief. Hypotheses are stated in terms of the 
population parameter to be tested, like the population mean, u. For example, a 
researcher may be interested in the mean daily return on stock options. Hence, the 
hypothesis may be that the mean daily return on a portfolio of stock options is positive. 


Hypothesis testing procedures, based on sample statistics and probability theory, are 
used to determine whether a hypothesis is a reasonable statement and should not be 
rejected or if it is an unreasonable statement and should be rejected. Any hypothesis 
test has six components: 


= The null hypothesis, which specifies a value of the population parameter that is 
assumed to be true. 


= The alternative hypothesis, which specifies the values of the test statistic over which 
we should reject the null. 


= The test statistic, which is calculated from the sample data. 
« The size of the test (commonly referred to as the significance level), which specifies 
the probability of rejecting the null hypothesis when it is true. 


= The critical value, which is the value that is compared to the value of the test statistic 
to determine whether or not the null hypothesis should be rejected. 


= The decision rule, which is the rule for deciding whether or not to reject the null 
hypothesis based on a comparison of the test statistic and the critical value. 


aS PROFESSOR'S NOTE 
Throughout this reading we use the more commonly used term significance 
level rather the test size. However, on the exam, recognize that if you see test 
size, it simply means significance level. 


The Null Hypothesis and Alternative Hypothesis 

The null hypothesis, designated Hg, is the hypothesis the researcher wants to reject. It 
is the hypothesis that is actually tested and is the basis for the selection of the test 
statistics. The null is generally a simple statement about a population parameter. 
Typical statements of the null hypothesis for the population mean include Ho: u = Ug, 
Ho: H < Up, and Ho: H = Up, where pis the population mean and pg is the hypothesized 
value of the population mean. 


LT PROFESSOR'S NOTE 
ê The null hypothesis always includes the equal to condition. 


The alternative hypothesis, designated H 4, is what is concluded if there is sufficient 
evidence to reject the null hypothesis. It is usually the alternative hypothesis the 
researcher is really trying to assess. Why? Because you can never really prove anything 
with statistics, when the null hypothesis is discredited, the implication is that the 
alternative hypothesis is valid. 


The Choice of the Null and Alternative Hypotheses 


The most common null hypothesis will be an equal to hypothesis. The alternative is 
often the hoped-for hypothesis. When the null is that a coefficient is equal to zero, we 
hope to reject it and show the significance of the relationship. 


When the null is less than or equal to, the (mutually exclusive) alternative is framed as 
greater than. If we are trying to demonstrate that a return is greater than the risk-free 
rate, this would be the correct formulation. We will have set up the null and alternative 
hypothesis so rejection of the null will lead to acceptance of the alternative, our goal in 
performing the test. 


Hypothesis testing involves two statistics: the test statistic calculated from the sample 
data and the critical value of the test statistic. The value of the computed test statistic 
relative to the critical value is a key step in assessing the validity of a hypothesis. 


A test statistic is calculated by comparing the point estimate of the population 
parameter with the hypothesized value of the parameter (i.e. the value specified in the 
null hypothesis). With reference to our option return example, this means we are 
concerned with the difference between the mean return of the sample and the 
hypothesized mean return. As indicated in the following expression, the test statistic is 
the difference between the sample statistic and the hypothesized value, scaled by the 
standard error of the sample statistic. 
ti sample statistic — hypothesized value 

ri = n 
standard error of the sample statistic 
The standard error of the sample statistic is the adjusted standard deviation of the 
sample. When the sample statistic is the sample mean, X, the standard error of the 
sample statistic for sample size n, is calculated as: 


when the population standard deviation, o, is known, or 
S S 
=~ ir 
when the population standard deviation, o, is not known. In this case, it is estimated 
using the standard deviation of the sample, s. 


LT PROFESSOR'S NOTE 
“ Don’t be confused by the notation here. A lot of the literature you will 
encounter in your studies simply uses the term x for the standard error of 
the test statistic, regardless of whether the population standard deviation or 
sample standard deviation was used in its computation. 


One-Tailed and Two-Tailed Tests of Hypotheses 


LO 17.b: Differentiate between a one-sided and a two-sided test and identify 
when to use each test. 


The alternative hypothesis can be one-sided or two-sided. A one-sided test is referred 
to as a one-tailed test, and a two-sided test is referred to as a two-tailed test. Whether 
the test is one- or two-sided depends on the proposition being tested. If a researcher 
wants to test whether the return on stock options is greater than zero, a one-tailed test 
should be used. However, a two-tailed test should be used if the research question is 
whether the return on options is simply different from zero. Two-sided tests allow for 
deviation on both sides of the hypothesized value (zero). In practice, most hypothesis 
tests are constructed as two-tailed tests. 


A two-tailed test for the population mean may be structured as: 
Hp: H = Hy versus H, : u ¥ po 


Because the alternative hypothesis allows for values above and below the hypothesized 
parameter, a two-tailed test uses two critical values (or rejection points). 


The general decision rule for a two-tailed test is: 
Reject Hy if test statistic > upper critical value or 
test statistic < lower critical value 

Let’s look at the development of the decision rule for a two-tailed test using a z- 

distributed test statistic (a z-test) at a 5% level of significance, a = 0.05. 

= Ata=0.05, the computed test statistic is compared with the critical z-values of 
+1.96. The values of +1.96 correspond to *2,,,» = £29925, which is the range of z-values 
within which 95% of the probability lies. These values are obtained from the 
cumulative probability table for the standard normal distribution (z-table), which is 
included at the back of this book. 

= Ifthe computed test statistic falls outside the range of critical z-values (i-e., test 
statistic > 1.96, or test statistic < -1.96), we reject the null and conclude that the 
sample statistic is sufficiently different from the hypothesized value. 

= Ifthe computed test statistic falls within the range +1.96, we conclude that the 
sample statistic is not sufficiently different from the hypothesized value (u = ug in 
this case), and we fail to reject the null hypothesis. 


The decision rule (rejection rule) for a two-tailed z-test at a = 0.05 can be stated as: 
Reject Hy if test statistic < —1.96 or 

test statistic > 1.96 
Figure 17.1 shows the standard normal distribution for a two-tailed hypothesis test 


using the z-distribution. Notice that the significance level of 0.05 means that there is 
0.05 / 2 = 0.025 probability (area) under each tail of the distribution beyond +1.96. 


Figure 17.1: Two-Tailed Hypothesis Test 
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EXAMPLE: Two-tailed test 


A researcher has gathered data on the daily returns on a portfolio of call options 
over a recent 250-day period. The mean daily return has been 0.1%, and the sample 
standard deviation of daily portfolio returns is 0.25%. The researcher believes the 
mean daily portfolio return is not equal to zero. Construct a hypothesis test of the 
researcher’s belief. 


Answer: 


First, we need to specify the null and alternative hypotheses. The null hypothesis is 
the one the researcher expects to reject. 


Hp: Hy = 0 versus H,: pty # 0 
Because the null hypothesis is an equality, this is a two-tailed test. At a 5% level of 


significance, the critical z-values for a two-tailed test are +1.96, so the decision rule 
can be stated as: 


Reject Hy if: test statistic < —1.96 or test statistic > +1.96 


The standard error of the sample mean is the adjusted standard deviation of the 
sample. When the sample statistic is the sample mean, x, the standard error of the 
sample statistic for sample size n is calculated as: 


Because our sample statistic here is a sample mean, the standard error of the sample 
7 


0.0025 
mean for a sample size of 250 is a and our test statistic is: 
V250 
0.001 0.001 
= —— i S 
(= ) 0.0001 58 
¥250 


Because 6.33 > 1.96, we reject the null hypothesis that the mean daily option return 
is equal to zero. Note that when we reject the null, we conclude that the sample 
value is significantly different from the hypothesized value. We are saying that the 
two values are different from one another after considering the variation in the 
sample. That is, the mean daily return of 0.001 is statistically different from zero 
given the sample’s standard deviation and size. 


For a one-tailed hypothesis test of the population mean, the null and alternative 
hypotheses are either: 
= upper tail: Ho: u < Up versus Ha: 4 > Uo, or 


a lower tail: Hp: p 2 pọ versus Ha: 4 < pọ. 


The appropriate set of hypotheses depends on whether we believe the population mean, 
u, to be greater than (upper tail) or less than (lower tail) the hypothesized value, uo. 
Using a z-test at the 5% level of significance, the computed test statistic is compared 
with the critical values of 1.645 for the upper tail tests (i.e, Hy: p > Ug) or -1.645 for 
lower tail tests (i.e, Hy: 4 < Ug). These critical values are obtained from a z-table, where 
-Z9.05 = -1.645 corresponds to a cumulative probability equal to 5%, and the Z9 95 = 
1.645 corresponds to a cumulative probability of 95% (1 - 0.05). 


Let’s use the upper tail test structure where Ho: u < pọ and Hy: u > Up. 


= Ifthe calculated test statistic is greater than 1.645, we conclude that the sample 
statistic is sufficiently greater than the hypothesized value. In other words, we reject 
the null hypothesis. 


= Ifthe calculated test statistic is less than 1.645, we conclude that the sample statistic 
is not sufficiently different from the hypothesized value, and we fail to reject the null 
hypothesis. 


Figure 17.2 shows the standard normal distribution and the rejection region for a one- 
tailed test (upper tail) at the 5% level of significance. 


Figure 17.2: One-Tailed Hypothesis Test 
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EXAMPLE: One-tailed test 


Perform a z-test using the option portfolio data from the previous example to test 
the belief that option returns are positive. 


Answer: 
In this case, we use a one-tailed test with the following structure: 
Ho: u < 0 versus Ha: p> 0 


The appropriate decision rule for this one-tailed z-test at a significance level of 5% 
is: 


Reject Ho if: test statistic > 1.645 


The test statistic is computed the same way, regardless of whether we are using a 
one-tailed or two-tailed test. From the previous example, we know the test statistic 
for the option return sample is 6.33. Because 6.33 > 1.645, we reject the null 
hypothesis and conclude that mean returns are statistically greater than zero at a 
5% level of significance. 


Type I and Type IT Errors 


LO 17.c: Explain the difference between Type I and Type II errors and how these 
relate to the size and power of a test. 


Keep in mind that hypothesis testing is used to make inferences about the parameters 
of a given population on the basis of statistics computed for a sample that is drawn 
from that population. We must be aware that there is some probability that the sample, 
in some way, does not represent the population and any conclusion based on the sample 
about the population may be made in error. 


When drawing inferences from a hypothesis test, there are two types of errors: 
= Type I error: the rejection of the null hypothesis when it is actually true. 
= Type II error: the failure to reject the null hypothesis when it is actually false. 


The significance level is the probability of making a Type I error (rejecting the null 
when it is true) and is designated by the Greek letter alpha (a). For instance, a 
significance level of 5% (a = 0.05) means there is a 5% chance of rejecting a true null 
hypothesis. When conducting hypothesis tests, a significance level must be specified in 
order to identify the critical values needed to evaluate the test statistic. 


The decision for a hypothesis test is to either reject the null hypothesis or fail to reject 
the null hypothesis. Note that it is statistically incorrect to say “accept” the null 
hypothesis; it can only be supported or rejected. The decision rule for rejecting or 
failing to reject the null hypothesis is based on the distribution of the test statistic. For 
example, if the test statistic follows a normal distribution, the decision rule is based on 
critical values determined from the standard normal distribution (z-distribution). 
Regardless of the appropriate distribution, it must be determined if a one-tailed or two- 
tailed hypothesis test is appropriate before a decision rule (rejection rule) can be 
determined. 


A decision rule is specific and quantitative. Once we have determined whether a one- or 
two-tailed test is appropriate, the significance level we require, and the distribution of 
the test statistic, we can calculate the exact critical value for the test statistic. Then we 
have a decision rule of the following form: if the test statistic is (greater, less than) the 
value X, reject the null. 


While the significance level of a test is the probability of rejecting the null hypothesis 
when it is true, the power of a test is the probability of correctly rejecting the null 
hypothesis when it is false. The power of a test is actually one minus the probability of 
making a Type II error, or 1 - P(Type II error). In other words, the probability of 
rejecting the null when it is false (power of the test) equals one minus the probability of 
not rejecting the null when it is false (Type II error). When more than one test statistic 
may be used, the power of the test for the competing test statistics may be useful in 
deciding which test statistic to use. Ordinarily, we wish to use the test statistic that 
provides the most powerful test among all possible tests. 


Figure 17.3 shows the relationship between the level of significance, the power of a test, 
and the two types of errors. 


Figure 17.3: Type I and Type II Errors in Hypothesis Testing 
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Sample size and the choice of significance level (Type I error probability) will together 
determine the probability of a Type II error. The relation is not simple, however, and 
calculating the probability of a Type II error in practice is quite difficult. Decreasing 
the significance level (probability of a Type I error) from 5% to 1%, for example, will 
increase the probability of failing to reject a false null (Type II error) and, therefore, 
reduce the power of the test. Conversely, for a given sample size, we can increase the 
power of a test only with the cost that the probability of rejecting a true null (Type I 
error) increases. For a given significance level, we can decrease the probability of a 
Type II error and increase the power of a test, only by increasing the sample size. 


The Relation Between Confidence Intervals and 
Hypothesis Tests 


LO 17.d: Understand how a hypothesis test and a confidence interval are related. 
A confidence interval is a range of values within which the researcher believes the 
true population parameter may lie. 


A confidence interval is determined as: 


| 
l 


sample (oT (standard ) _ Population _ 
Statistic aloe } error parameter 


sample | (man) (standard )}| 
statistic value error j 

The interpretation of a confidence interval is that for a level of confidence of 95%, for 
example, there is a 95% probability that the true population parameter is contained in 
the interval. 


From the previous expression, we see that a confidence interval and a hypothesis test 
are linked by the critical value. For example, a 95% confidence interval uses a critical 
value associated with a given distribution at the 5% level of significance. Similarly, a 
hypothesis test would compare a test statistic to a critical value at the 5% level of 
significance. To see this relationship more clearly, the expression for the confidence 
interval can be manipulated and restated as: 


-critical value < test statistic < +critical value 


This is the range within which we fail to reject the null for a two-tailed hypothesis test 
at a given level of significance. 


EXAMPLE: Confidence interval 


Using option portfolio data from the previous examples, construct a 95% confidence 
interval for the population mean daily return over the 250-day sample period. Use a 
z-distribution. Decide if the hypothesis u = 0 should be rejected. 


Answer: 


Given a sample size of 250 with a standard deviation of 0.25%, the standard error 
can be computed as: 


$s. = ia = 0.25/950= 0.0158% 


At the 5% level of significance, the critical z-values for the confidence interval are 
Z9.025 = 1.96 and -Z9.025 = -1.96. Thus, given a sample mean equal to 0.1%, the 95% 


confidence interval for the population mean is: 
0.1 — 1.96(0.0158) < u < 0.1 + 1.96(0.0158), or 
0.069% < p < 0.1310% 


Because there is a 95% probability that the true mean is within this confidence 
interval, we can reject the hypothesis u = 0 because 0 is not within the confidence 
interval. 


Notice the similarity of this analysis with our test of whether p = 0. We rejected the 
hypothesis u = 0 because the sample mean of 0.1% is more than 1.96 standard errors 
from zero. Based on the 95% confidence interval, we reject u = 0 because zero is 
more than 1.96 standard errors from the sample mean of 0.1%. 


Statistical Significance vs. Practical Significance 


Statistical significance does not necessarily imply practical significance. For example, 
we may have tested a null hypothesis that a strategy of going long all the stocks that 
satisfy some criteria and shorting all the stocks that do not satisfy the criteria resulted 
in returns that were less than or equal to zero over a 20-year period. Assume we have 
rejected the null in favor of the alternative hypothesis that the returns to the strategy 
are greater than zero (positive). This does not necessarily mean that investing in that 
strategy will result in economically meaningful positive returns. Several factors must 
be considered. 


One important consideration is transactions costs. Once we consider the costs of 
buying and selling the securities, we may find that the mean positive returns to the 
strategy are not enough to generate positive returns. Taxes are another factor that may 
make a seemingly attractive strategy a poor one in practice. A third reason that 
statistically significant results may not be economically significant is risk. In the 
strategy just discussed, we have additional risk from short sales (they may have to be 
closed out earlier than in the test strategy). Because the statistically significant results 
were for a period of 20 years, it may be the case that there is significant variation from 


year to year in the returns from the strategy, even though the mean strategy return is 
greater than zero. This variation in returns from period to period is an additional risk 
to the strategy that is not accounted for in our test of statistical significance. 


Any of these factors could make committing funds to a strategy unattractive, even 
though the statistical evidence of positive returns is highly significant. By the nature of 


statistical tests, a very large sample size can result in highly (statistically) significant 
results that are quite small in absolute terms. 


=) MODULE QUIZ 17.1 


1. Austin Roberts believes the mean price of houses in the area is greater than 
$145,000. A random sample of 36 houses in the area has a mean price of $149,750. 
The population standard deviation is $24,000, and Roberts wants to conduct a 


hypothesis test at a 1% level of significance. The appropriate alternative hypothesis 
is: 


A. Ha: H< $145,000. 
B. Hy: w+ $145,000. 
C. Hy: u2 $145,000. 
D. Hy: > $145,000. 


2. Which of the following statements about hypothesis testing is most accurate? 
A. The power of a test is one minus the probability of a Type I error. 
B. The probability of a Type I error is equal to the significance level of the test. 
C. To test the claim that X is greater than zero, the null hypothesis would be Ho: X > 
(0) 


D. If you can disprove the null hypothesis, then you have proven the alternative 
hypothesis. 


MODULE 17.2: HYPOTHESIS TESTING RESULTS 


LO 17.e: Explain what the p-value of a hypothesis test measures. 


The p-Value 


The p-value is the probability of obtaining a test statistic that would lead to a rejection 
of the null hypothesis, assuming the null hypothesis is true. It is the smallest level of 
significance for which the null hypothesis can be rejected. For one-tailed tests, the p- 
value is the probability that lies above the computed test statistic for upper tail tests or 
below the computed test statistic for lower tail tests. For two-tailed tests, the p-value is 
the probability that lies above the positive value of the computed test statistic plus the 
probability that lies below the negative value of the computed test statistic. 


Consider a two-tailed hypothesis test about the mean value of a random variable at the 
95% significance level where the test statistic is 2.3, greater than the upper critical 
value of 1.96. If we consult the z-table, we find the probability of getting a value greater 
than 2.3 is (1 - 0.9893) = 1.07%. Because it’s a two-tailed test, our p-value is 2 x 1.07 = 
2.14%, as illustrated in Figure 17.4. At a 3%, 4%, or 5% significance level, we would 
reject the null hypothesis, but at a 2% or 1% significance level, we would not. Many 


researchers report p-values without selecting a significance level and allow the reader 
to judge how strong the evidence for rejection is. 


Figure 17.4: Two-Tailed Hypothesis Test With p-Value = 2.14% 
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Confidence Intervals for Hypothesis Tests 


LO 17.f: Construct and apply confidence intervals for one-sided and two-sided 
hypothesis tests and interpret the results of hypothesis tests with a specific 
confidence level. 


As mentioned earlier, confidence interval estimates result in a range of values within 
which the actual value of a parameter will lie, given the probability of 1 - a, where 
alpha, a, is the level of significance for the confidence interval, and the probability 1 - a 
is the degree of confidence. Recall the confidence interval for a two-tailed test: 


{ | sample (critical) standard) | population — 
| Istatistic aei error parameter 


sample | /critical (standard |} 
statistic \ value error | 


Confidence intervals can also be provided for one-tailed tests as either: 


Upper tail: 


| sample ‘an ag (standard) | - _ population Ca \ 
statistic odio: ( error ” parameter É | 


Lower tail: 


Í P population - 
| = parameter ~ 


sample -A — | (arnt |} 
statistic value 

With hypothesis testing, the choice between using a critical value based on the t- 
distribution or the z-distribution depends on the sample size, the distribution of the 
population, and whether the variance of the population is known or unknown. 


The t-Test 


The t-test is a widely used hypothesis test that employs a test statistic that is 
distributed according to a t-distribution. Following are the rules for when it is 
appropriate to use the t-test for hypothesis tests of the population mean. 


Use the t-test if the population variance is unknown and either of the following 
conditions exist: 


= The sample is large (n = 30). 


= The sample is small (n < 30), but the distribution of the population is normal or 
approximately normal. 


If the sample is small and the distribution is nonnormal, we have no reliable statistical 
test. 


The computed value for the test statistic based on the t-distribution is referred to as 
the t-statistic. For hypothesis tests of a population mean, a t-statistic with n - 1 degrees 
of freedom is computed as: 


X— Ho 


where 


x = sample mean 

y= hypothesized population mean (i.¢., the null) 
s = standard deviation of the sample 

n = sample size 


Si PROFESSOR'S NOTE 
This computation is not new. It is the same test statistic computation that we 


have been performing all along. Note the use of the sample standard deviation, 
s, in the standard error term in the denominator. 


To conduct a t-test, the t-statistic is compared to a critical t-value at the desired level 
of significance with the appropriate degrees of freedom. 


In the real world, the underlying variance of the population is rarely known, so the t- 
test enjoys widespread application. 


The z-Test 


The z-test is the appropriate hypothesis test of the population mean when the 
population is normally distributed with known variance. The computed test statistic 
used with the z-test is referred to as the z-statistic. The z-statistic for a hypothesis test 
for a population mean is computed as follows: 


X— Hy 


zStatistic = ——— 
o/wn 


where: 

x = sample mean 

Hy = hypothesized population mean 

a = standard deviation of the population 
n = sample size 


To test a hypothesis, the z-statistic is compared to the critical z-value corresponding to 
the significance of the test. Critical z-values for the most common levels of significance 
are displayed in Figure 17.5. You should memorize these critical values for the exam. 


Figure 17.5: Critical z-Values 


Level of Two-Tailed One-Tailed 
Significance Test Test 
0.10 = 10% +1.65 +1.28 or —1.28 
0.05 = 5% +1.96 +1.65 or —1.65 
0.01 = 1% +2.58 +2.33 or —2.33 


When the sample size is large and the population variance is unknown, the z-statistic is: 


z-Statistic = 


where 

Xx = sample mean 

Hy = hypothesized population mean 

s = standard deviation of the sample 
n =sample size 


Note the use of the sample standard deviation, s, versus the population standard 
deviation, o. Remember, this is acceptable if the sample size is large, although the t- 
statistic is the more conservative measure when the population variance is unknown. 


EXAMPLE: z-test or t-test? 


Referring to our previous option portfolio mean return problem once more, 
determine which test statistic (z or t) should be used and the difference in the 
likelihood of rejecting a true null with each distribution. 


Answer: 


The population variance for our sample of returns is unknown. Hence, the t- 
distribution is appropriate. With 250 observations, however, the sample is 
considered to be large, so the z-distribution would also be acceptable. This is a trick 
question—either distribution, t or z, is appropriate. With regard to the difference in 
the likelihood of rejecting a true null, because our sample is so large, the critical 
values for the t and z are almost identical. Hence, there is almost no difference in the 
likelihood of rejecting a true null. 


EXAMPLE: The z-test 


When your company’s gizmo machine is working properly, the mean length of 
gizmos is 2.5 inches. However, from time to time the machine gets out of alignment 
and produces gizmos that are either too long or too short. When this happens, 
production is stopped and the machine is adjusted. To check the machine, the quality 
control department takes a gizmo sample each day. Today, a random sample of 49 
gizmos showed a mean length of 2.49 inches. The population standard deviation is 
known to be 0.021 inches. Using a 5% significance level, determine if the machine 
should be shut down and adjusted. 


Answer: 


Let u be the mean length of all gizmos made by this machine, and let x be the 
corresponding mean for the sample. A common hypothesis testing procedure is 
outlined as follows: 


Statement of hypothesis. For the information provided, the null and alternative 
hypotheses are appropriately structured as: 


Ho: p = 2.5 (The machine does not need an adjustment.) 
Ha: y # 2.5 (The machine needs an adjustment.) 


Note that because this is a two-tailed test, H, allows for values above and below 2.5. 


Select the appropriate test statistic. Because the population variance is known and 
the sample size is > 30, the z-statistic is the appropriate test statistic. The z-statistic 
is computed as: 


X — Ho 


zZz = Seen 
o/ yn 
Specify the level of significance. The level of significance is given at 5%, implying that 
we are willing to accept a 5% probability of rejecting a true null hypothesis. 


State the decision rule regarding the hypothesis. The # sign in the alternative 
hypothesis indicates that the test is two-tailed with two rejection regions, one in 
each tail of the standard normal distribution curve. Because the total area of both 
rejection regions combined is 0.05 (the significance level), the area of the rejection 
region in each tail is 0.025. You should know that the critical z-values for +Zp 92s are 


+1.96. This means that the null hypothesis should not be rejected if the computed z- 
statistic lies between -1.96 and +1.96 and should be rejected if it lies outside of these 
critical values. The decision rule can be stated as: 


Reject Hg if: z-statistic < -Zp 925 or Z-statistic > Zo 925, or equivalently 
Reject Hg if: z-statistic < -1.96 or z-statistic > + 1.96 


Collect the sample and calculate the test statistic. The value of x from the sample is 
2.49. Because o is given as 0.021, we calculate the z-statistic using x as follows: 


X—My 249-25 0.01 
z= = = = 3.33 
o/Vn 0.021/ V49 0.003 


Make a decision regarding the hypothesis. The calculated value of the z-statistic is 
-3.33. Because this value is less than the critical value, -Zg 925 = -1.96, it falls in the 
rejection region in the left tail of the z-distribution. Hence, there is sufficient 
evidence to reject Hp. 


Make a decision based on the results of the test. Based on the sample information and 
the results of the test, it is concluded that the machine is out of adjustment and 
should be shut down for repair. 


Testing the Equality of Means 


LO 17.g: Identify the steps to test a hypothesis about the difference between two 
population means. 


In finance, we are often interested in testing whether the means of two populations are 
equal to each other. This is equivalent to testing whether the difference between the 
two means is zero. 


If we assume two series (X and Y) are each independent and identically distributed 
(i.i.d.) and have a covariance of Cov(X,Y), the appropriate test statistic is: 


x-9 
T = 
X 


) n 


This test statistic has a standard normal distribution when the null hypothesis is true. 


| maa ee 


The steps to test the hypothesis that the means are equal would then follow the 
standard hypothesis testing procedure. The null hypothesis would be that the difference 
between the two is equal to zero, versus the alternative that it is not equal to zero. 
Given the test size and the appropriate critical value, the null would be rejected or fail 
to be rejected by comparing the test statistic to the critical value. 


Multiple Hypothesis Testing 


LO 17.h: Explain the problem of multiple testing and how it can lead to biased 
results. 


Multiple testing means testing multiple different hypothesis on the same data set. For 
example, suppose we are testing 10 active trading strategies against a buy-and-hold 
trading strategy. The problem is that if we keep testing different strategies against the 
same null hypothesis, it is highly likely we are eventually going to reject one of them. 
The problem with this is that the alpha (the probability of incorrectly rejecting a true 
null) is only accurate for one single hypothesis test. As we test more and more 
strategies, the actual alpha of this repeated testing grows larger, and as alpha grows 
larger, the probability of a Type I error increases. 


2) MODULE QUIZ 17.2 
— 1. The most likely bias to result from testing multiple hypotheses on a single data set is 
that the value of: 
A. a Type I error will increase. 
B. a Type II error will increase. 
C. the critical value will increase. 
D. the test statistic will increase. 


2. Austin Roberts believes the mean price of houses in the area is greater than 
$145,000. A random sample of 36 houses in the area has a mean price of $149,750. 
The population standard deviation is $24,000, and Roberts wants to conduct a 
hypothesis test at a 1% level of significance. The value of the calculated test 


Statistic is closest to: 


A. z= 0.67. 
B. z= 1.19. 
C. z = 4.00. 
D. z = 8.13. 


KEY CONCEPTS 


LO 17.a 


The hypothesis testing process requires a statement of a null and an alternative 
hypothesis, the selection of the appropriate test statistic, specification of the 
significance level, a decision rule, the calculation of a sample statistic, a decision 
regarding the hypotheses based on the test, and a decision based on the test results. 
The test statistic is the value that a decision about a hypothesis will be based on. For a 
test about the value of the mean of a distribution: 

sample mean — hypothesized mean 


test statistic = 
standard error of sample mean 


LO 17.b 

A two-tailed test results from a two-sided alternative hypothesis (e.g, Ha: u + Ug). A 
one-tailed test results from a one-sided alternative hypothesis (e.g., Hy: u > yọ or Ha: LW 
< Ho). 


LO 17.c 


True Condition 


: a Incorrect decision 
Do not reject H, Correct decision 
Type II error 


Incorrect decision 


Decision 


Correct decision 
Type I error 


Reject Hy PeR iy: Power of the test 
Significance level, œ, = | — P(Type I error) 


= P(Type I error) 


LO 17.d 
Hypothesis testing compares a computed test statistic to a critical value at a stated 
level of significance, which is the decision rule for the test. 


A hypothesis about a population parameter is rejected when the sample statistic lies 
outside a confidence interval around the hypothesized value for the chosen level of 
significance. 


LO 17.e 


The p-value is the probability of obtaining a test statistic that would lead to a rejection 
of the null hypothesis, assuming the null hypothesis is true. It is the smallest level of 


significance for which the null hypothesis can be rejected. 


LO 17.f 


For hypothesis tests of a population mean, a t-statistic with n - 1 degrees of freedom is 
computed as: 


X — Ho 


C F 


To conduct a t-test, the t-statistic is compared to a critical t-value at the desired level 
of significance with the appropriate degrees of freedom. 


LO 17.g 


The appropriate test statistic to test whether the means of two populations are equal to 
each other is: 
x-Y 

Sx +s} — 2Cov(X, Y) 


n 


T = 


This test statistic has a standard normal distribution when the null hypothesis is true. 


LO 17.h 

Multiple testing means testing multiple different hypothesis on the same data set. The 
problem with multiple testing is that the alpha is only accurate for one single 
hypothesis test. As we test more and more strategies, the alpha of this repeated testing 
grows larger, and as alpha grows larger, the probability of a Type I error increases. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 17.1 
1.D Ha: p> $145,000 


(LO 17.b) 


2.B The probability of getting a test statistic outside the critical value(s) when the 
null is true is the level of significance and is the probability of a Type I error. The 
power of a test is one minus the probability of a Type II error. Hypothesis testing 
does not prove a hypothesis; we either reject the null or fail to reject it. The 
appropriate null would be X < 0 with X > 0 as the alternative hypothesis. (LO 17.c) 


Module Quiz 17.2 


1.A With multiple testing, the alpha (the probability of incorrectly rejecting a true 
null) is only accurate for one single hypothesis test. As we test more and more 
strategies, the actual alpha of this repeated testing grows larger, and as alpha 
grows larger, the probability of a Type I error increases. (LO 17.h) 


149,750 — 145,000 


B z = ———_ = 1.1875 


24,000/ V36 
(LO 17.f) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 7. 


READING 18 
LINEAR REGRESSION 


Study Session 6 


EXAM FOCUS 


Linear regression refers to the process of representing relationships with linear 
equations where there is one dependent variable being explained by one or more 
independent variables. Typically, we estimate a regression equation using ordinary 
least squares (OLS), which minimizes the sum of squared errors in the sample data. For 
the exam, be able to conduct hypothesis tests, calculate confidence intervals, and 
remember the assumptions underlying the regression model. Finally, understand how 
to interpret a regression equation. 


MODULE 18.1: REGRESSION ANALYSIS 


LO 18.a: Describe the models that can be estimated using linear regression and 
differentiate them from those which cannot. 


Regression analysis seeks to measure how changes in one variable, called a 
dependent (or explained) variable can be explained by changes in one or more other 
variables called the independent (or explanatory) variables. This relationship is 
captured by estimating a linear equation. 


As an example, we want to capture the relationship between hedge fund returns and 
lockup periods. 


For this simple two-variable case (i.e., one explained and one explanatory variable), the 
function is: 


E(return) = a + B x (lockup period) 
Or more generally: 
E(Y) =a+B x (X) 


Which we can write as: 


Y=a+8 “(X)t+e 

where: 

4 = regression or slope coefficient; sensitivity of Y to changes in X 
«= value of Y when X = 0 

= = random error or shock; unexplained (by X) component of Y 


This error may be reduced by using more independent variables or by using different, 
more appropriate independent variables. 


Note that the interpretation of a changes when X cannot realistically take on a value of 
0. In such a case, a is the value that ensures that the mean of Y lies on the fitted 
regression line. 


Linear Regression Conditions 
To use linear regression, three conditions need to be satisfied: 
1. The relationship between Y and X should be linear (discussed later). 


2. The error term must be additive (i.e., the variance of the error term is independent of 
the observed data). 


3. All X variables should be observable (i.e. makes the model inappropriate when you 
have missing data). 


The term linear has implications for both the independent variable(s) and the unknown 
parameters (i.e., the coefficients). However, appropriate transformations of the 
independent variable(s) can make a nonlinear relationship amenable to be fitted using a 
linear model. 


If the relationship between the dependent variable (Y) and an independent variable (X) 
is nonlinear, then an analyst would do that transformation first and then enter the 
transformed value into the linear equation as X. For example, in estimating a utility 
function as a function of consumption, we might allow for the property of diminishing 
marginal utility by transforming consumption into a logarithm of consumption. In 
other words, the actual relationship is: 


E(utility) = a + B x In (amount consumed) 


Here we let Y = utility and X = In (amount consumed) and then estimate: E(Y) = a + B x 
(X) using linear techniques. 


A second interpretation of the term linear applies to the unknown coefficients. It 
specifies that the dependent variable is a linear function of the coefficients. For 
example, consider an unknown parameter, p, in a function: Y = a + BX? + g. In this 
instance, BX? contains two unknown parameters (f and p) and p does not enter the 
model multiplicatively and, hence, it would not be appropriate to apply linear 
regression in such a case. 


=) MODULE QUIZ 18.1 


1. Generally, if the value of the independent variable is zero, then the expected value 
of the dependent variable would be equal to the: 


A. slope coef ficient. 


B. intercept coefficient. 
C. error term. 
D. residual. 
2. The error term represents the portion of the: 
A. dependent variable that is not explained by the independent variable(s) but could 
possibly be explained by adding additional independent variables. 
B. dependent variable that is explained by the independent variable(s). 
C. independent variables that are explained by the dependent variable. 
D. dependent variable that is explained by the error in the independent variable(s). 
3. A linear regression function assumes that the relation being modeled must be linear 
in: 
A. both the variables and the coefficients. 
B. the coefficients but not necessarily the variables. 


C. the variables but not necessarily the coefficients. 
D. neither the variables nor the coefficients. 


MODULE 18.2: ORDINARY LEAST SQUARES 
ESTIMATION 


LO 18.b: Interpret the results of an ordinary least squares (OLS) regression with 
a single explanatory variable. 


Ordinary least squares (OLS) estimation is a process that estimates the parameters a 
and ß in an effort to minimize the squared residuals (i-e., error terms). 


Rewriting our regression equation: § = Y; — (a + 8 x X,); the OLS sample 
coefficients are those that minimize: Y=? = Z[Y; — (o +8 x X) F 


The estimated slope coefficient (8) for the regression line describes the change in Y 
for a one-unit change in X. The slope term is calculated as: 
¥ (x,—X)(v,-¥ 
= a Ys Cov(X,Y) 
t = —— E ÁÁ 
a Var(X) 
¥ (xX, —X) 


The intercept term (a) is the line’s intersection with the Y-axis at X = 0. A property of 
the least squares method is that the intercept term may be expressed as: 


a= Y 3X 
where 


Y= mean of Y 
X= mean of X 


The intercept equation highlights the fact that the regression line passes through a 
point with coordinates equal to the mean of the independent and dependent variables. 


Interpreting Regression Results 


The intercept term, a, is the value of the dependent variable when the independent 
variable is equal to zero. The slope coefficient, $, is the estimated change in the 
dependent variable for a one-unit change in that independent variable. In the case 
where the model uses multiple independent variables, the interpretation of the slope 
coefficient captures the change in the dependent variable for a one-unit change in the 
independent variable, holding the other independent variables constant. As you will see 
in the next reading, this is why the slope coefficients in a multiple regression are 
sometimes called partial slope coefficients. 


EXAMPLE: Regression model with one explanatory variable 


The mean annual return y over the past 20 years for a specific stock is 11%, while 
that for the market x is 8.4%. The covariance of annual returns for the stock and the 
market (oyy) and the variance of the market (sł) are shown in the following 


variance-covariance matrix: 
as eas) E 151.22 132.11 ) 
o o? 132.11 181.40 
Calculate the estimated slope coefficient and intercept, and interpret the regression 


results. 


Answer: 


-COW SYY 399 14 
3 = — = = 0.73 
Var X o3 181.40 


a = Y — 8X = 0.11 —0.73 x 0.084 = 0.049 
Interpretation of the coefficients: 


B: A 1% increase in returns in the market would lead to an increase in 0.73% 
increase in the return on the stock. 


a: If the market return is 0%, Stock A’s return would be 0.049 or 4.9%. 


Dummy Variables 


Observations for most independent variables (e.g., firm size, level of GDP, and interest 
rates) can take on a wide range of values. However, there are occasions when the 
independent variable is binary in nature—it is either on or off. Independent variables 
that fall into this category are called dummy variables and are often used to quantify 
the impact of qualitative variables. 


Dummy variables are assigned a value of 0 or 1. For example, in a time series regression 
of monthly stock returns, you could employ a January dummy variable that would take 
on the value of 1 if a stock return occurred in January, and 0 if it occurred in any other 
month. The purpose of including the January dummy variable would be to see if stock 
returns in January were significantly different than stock returns in all other months of 
the year. 


Coefficient of Determination of a Regression (R°) 


LO 18.h: Estimate the correlation coefficient from the R? measure obtained in 
linear regressions with a single explanatory variable. 


The R? of a regression model captures the fit of the model; it represents the proportion 
of variation in the dependent variable that is explained by the independent variable(s). 
For a regression model with a single independent variable, R? is the square of the 
correlation between the independent and dependent variable. 


R=ryy 


where ry y = correlation between X and Y 


Assumptions Underlying Linear Regression 


LO 18.c: Describe the key assumptions of OLS parameter estimation. 


OLS regression requires a number of assumptions. Most of the major assumptions 
pertain to the regression model’s residual term (i.e. the error term). The key 
assumptions are as follows: 


= The expected value of the error term, conditional on the independent variable, is zero 
[E(e;|X;) = 0]. This means that X has no information about the location of £. This 
assumption is not directly testable; OLS estimates using sample data ensure that the 
shocks are always uncorrelated with Xs. Evaluation of whether this assumption is 
reasonable requires an examination of the data generating process. Generally, a 
violation would be evidenced by the following: 


- Survivorship, or sample selection, bias: Survivorship bias occurs when the 
observations are collected after-the-fact (e.g., companies that get dropped from an 
index are not included in the sample). Sample selection bias occurs when 
occurrence of an event (i.e., an observation) is contingent on specific outcomes. For 
example, mortgage refinancing is severely curtailed during falling housing prices 
and, hence, the sample of actual refinancing transactions is therefore more likely to 
occur during a rising home-price environment. 

- Simultaneity bias: This happens when the values of X and Y are simultaneously 
determined. For example, trading volume and volatility are related; volume 
increases during volatile times. 


- Omitted variables: Important explanatory (i.e., X) variables are not excluded 
from the model. If they are, the errors will capture the influence of the omitted 
variables. Omission of important variables cause the coefficients to be biased and 
may indicate nonexistent (i.e, misleading) relationships. 


- Attenuation bias: This occurs when X variables are measured with error and 
leads to underestimation of the regression coefficients. 


= All (X, Y) observations are independent and identically distributed (i.i.d.). 


= Variance of X is positive (otherwise estimation of B would not be possible). 


= Variance of the errors is constant (i.e., homoskedasticity). 


= It is unlikely that large outliers will be observed in the data. OLS estimates are 
sensitive to outliers, and large outliers have the potential to create misleading 
regression results. 


Collectively, these assumptions ensure that the regression estimators are unbiased (i.e., 
E(â) = a and E(3) = 3). Secondly, they ensure that the estimators are normally distributed 
and, as a result, allowed for hypothesis testing (discussed later). 


Properties of OLS Estimators 


LO 18.d: Characterize the properties of OLS estimators and their sampling 
distributions. 


Because OLS estimators are derived from random samples, these estimators are also 
random variables because they vary from one sample to the next. Therefore, OLS 
estimators will have their own probability distributions (i.e, sampling distributions). 
These sampling distributions allow us to estimate population parameters, such as the 
population mean, the population regression intercept term, and the population 
regression slope coefficient. 


Drawing multiple samples from a population will produce multiple sample means. The 
distribution of these sample means is referred to as the sampling distribution of the 
sample mean. The mean of this sampling distribution is used as an estimator of the 
population mean and is said to be an unbiased estimator of the population mean. An 
unbiased estimator is one for which the expected value of the estimator is equal to the 
parameter you are trying to estimate. 


Given the central limit theorem (CLT), for large sample sizes, it is reasonable to 
assume that the sampling distribution will approach the normal distribution. This 
means that the estimator is also a consistent estimator. A consistent estimator is one 
for which the accuracy of the parameter estimate increases as the sample size 
increases. 


Like the sampling distribution of the sample mean, OLS estimators for the population 
intercept term and slope coefficient also have sampling distributions. The sampling 
distributions of OLS estimators, a and ß, are unbiased and consistent estimators of 
respective population parameters. Being able to assume that a and £8 are normally 
distributed is a key property in allowing us to make statistical inferences about 
population coefficients. 


The variance of the slope (fP) increases with variance of the error and decreases with 
the variance of the explanatory variable. This makes sense because the variance of the 
slope indicates the reliability of the sample estimate of the coefficient, and the higher 
the variance of the error, the lower the reliability of the coefficient estimate. Higher 
variance of the explanatory (X) variable(s) indicates that there is sufficient diversity in 
observations (i.e., the sample is representative of the population) and, hence, lower 
variability (and higher confidence) of the slope estimate. 


2) MODULE QUIZ 18.2 
1. Ordinary least squares (OLS) refers to the process that: 
A. maximizes the number of independent variables. 
B. minimizes the number of independent variables. 
C. produces sample regression coefficients. 
D. minimizes the sum of the squared error terms. 


2. What is the most appropriate interpretation of a slope coefficient estimate equal to 
10.0? 
A. The predicted value of the dependent variable when the independent variable is 
zero is 10.0. 
B. The predicted value of the independent variable when the dependent variable is 
zero is 0.1. 
C. For every one unit change in the independent variable, the model predicts that 
the dependent variable will change by 10 units. 
D. For every one unit change in the independent variable, the model predicts that 
the dependent variable will change by 0.1 units. 


3. The reliability of the estimate of the slope coefficient in a regression model is most 
likely: 
A. positively affected by the variance of the residuals and negatively affected by 
the variance of the independent variables. 
B. negatively affected by the variance of the residuals and negatively affected by 
the variance of the independent variables. 
C. positively affected by the variance of the residuals and positively affected by 
the variance of the independent variables. 
D. negatively affected by the variance of the residuals and positively affected by 
the variance of the independent variables. 


4. The mean inflation (Y) over the past 108 months is 0.01. Mean unemployment during 
that same time period (x) is 0.044. The variance-covariance matrix for these 
variables is as follows: 


yy 254 45.76 
H o? ) = (4576 16.84) 
What is the estimated slope coefficient and intercept, respectively? 
A. 2.72 and -0.11. 

B. 1.89 and 0.01. 

C. 3.44 and -0.52. 

D. 1.44 and 1.23. 


5. A researcher estimates that the value of the slope coefficient in a single explanatory 
variable linear regression model is equal to zero. Which one of the following is most 
appropriate interpretation of this result? 

A. The mean of the Y variable is zero. 
B. The intercept of the regression is zero. 
C. The relation between X and Y is not linear. 


D. The coefficient of determination (R°) of the model is zero. 


MODULE 18.3: HYPOTHESIS TESTING 


LO 18.e: Construct, apply, and interpret hypothesis tests and confidence intervals 
for a single regression coefficient in a regression. 


LO 18.f: Explain the steps needed to perform a hypothesis test in a linear 
regression. 


LO 18.g: Describe the relationship among a t-statistic, its p-value, anda 
confidence interval. 


The steps in the hypothesis testing procedure for regression coefficients are as follows: 
1. Specify the hypothesis to be tested. 
2. Calculate the test statistic. 


3. Reject or fail to reject the null hypothesis after comparing the test statistic to its 
critical value. 


Given that the OLS regression assumptions discussed previously are valid, the 
estimated slope coefficient, B, will be normally distributed with a standard deviation 
known as the standard error of the regression coefficient (S,,). We can then conduct 


hypothesis testing using sample value of the coefficient and its standard error. 
Suppose we want to test the hypothesis that the value of the slope coefficient is equal 
to Bo. 

Hp: 3 = Sy versus Hy: 8 # By 
For this example, we use the following t-statistic: 


3 — Bo 


t = — 
Sh 
If the absolute value of the test statistic exceeds the critical t-value (from the t-table, n 


- 2 degrees of freedom), we would reject the null hypothesis. 


EXAMPLE: Hypothesis testing of slope coefficient 


A regression model estimated using 46 observations has f = 0.76 and S, = 0.33. 
Determine if the slope coefficient is statistically different from zero at 5% level of 
significance. The critical t-value for a sample size of 46 and 5% level of significance 
is 2.02. 


Answer: 


B— Bo 0.76—-0 
t= = = 2.30 
S, 0.33 


Critical t-value = 2.02 (given). 


Because 2.30 > 2.02, we reject the null hypothesis and conclude the alternate 
hypothesis (that the slope coefficient is not equal to zero). 


Confidence Intervals 
The confidence interval of the slope coefficient = p + (t, x Sp). 


Where t, is the critical t-value for a given level of significance and degrees of freedom 
(n - 2). 

In the previous example, $ = 0.76, S, = 0.33, and t, = 2.02. Thus, the confidence interval 
of slope coefficient = 0.76 + (2.02 x 0.33) = 0.0934 < slope coefficient < 1.43. 


Notice that zero does not fall in the confidence interval, which should always be the 
case if we correctly rejected the null hypothesis of 8 = 0 in our hypothesis test. 
Similarly, we can also test the hypothesis where H,: B = 0.20 versus H,: G + 0.20 and fail 


to reject the null hypothesis because 0.20 does fall within the confidence interval. 


In other words, if the hypothesized value of the slope coefficient falls outside of the 
confidence interval, we can reject the null. If it falls inside the confidence interval, we 
fail to reject the null hypothesis. 


The p-Value 


The p-value is the smallest level of significance for which the null hypothesis can be 
rejected. An alternative method of doing hypothesis testing of regression coefficients is 
to compare the p-value to the significance level: 


= Ifthe p-value is less than the significance level, the null hypothesis can be rejected. 


= Ifthe p-value is greater than the significance level, the null hypothesis cannot be 
rejected. 


In general, regression outputs will provide the p-value for the standard hypothesis (Ho: 
B = 0 versus Hx: B # 0). 


Consider again the example where f = 0.76, S, = 0.33, and the level of significance is 5%. 
The regression output provides a p-value = 0.026. Because the p-value < level of 
significance, we reject the null hypothesis that B = 0, which is the same result as the one 
we got when performing the t-test. 

2) MODULE QUIZ 18.3 


=h 
Use the following information to answer Questions 1-3. 


Bob Shepperd is trying to forecast 10-year T-bond yield. Shepperd tries a variety of 
explanatory variables in several iterations of a single-variable model. Partial results are 


provided below (note that these represent three separate one-variable regressions): 


Explanatory Variable Coefficient Standard Error p-Value 


Inflation 1.08 0.67 0.11 
Unemployment rate -0.48 0.12 < 0.001 
GDP growth rate 1.33 0.45 0.005 


The critical t-value at 5% level of significance is equal to 2.02. 
1. For the regression model involving inflation as the explanatory variable, the 
confidence interval for the slope coefficient is closest to: 
A. -0.27 to 2.43. 
B. 0.26 to 2.43. 
C. -2.27 to 2.43. 
D. 0.22 to 1.88. 


2. For the regression model involving unemployment rate as the explanatory variable, 
what are the results of a hypothesis test that the slope coefficient is equal to 0.20 
(vs. not equal to 0.20) at 5% level of significance? 


A. The coefficient is not significantly different from 0.20 because the p-value is < 
0.001. 

B. The coefficient is significantly different from 0.20 because the t-value is 2.33, 
which is greater than the critical t-value of 2.02. 

C. The coefficient is significantly different from 0.20 because the t-value is -5.67. 

D. The coefficient is not significantly different from 0.20 because the t-value is 
-2.33. 


3. For the regression model involving GDP growth rate as the explanatory variable, at a 
5% level of significance, which of the following statements about the slope 
coefficient is least accurate? 


A. The coefficient is significantly different from O because the p-value is 0.005. 

B. The coefficient is significantly different from O because the 95% confidence 
interval does not include the value of O. 

C. The coefficient is significantly different from O because the t-value is 2.27. 

D. The coefficient is not significantly different from 1 because t-value is 0.73. 


KEY CONCEPTS 


LO 18.a 

Regression analysis attempts to measure the relationship between a dependent variable 
and one or more independent variables. 

To use linear regression, the following three conditions need to be satisfied: 

1. The relationship between Y and X should be linear. 

2. The variance of the error term is independent of the observed data. 

3. All X variables should be observable. 


LO 18.b 


The intercept term, a, is the value of the dependent variable when the independent 
variables are all equal to zero. The slope coefficient, $, is the estimated change in the 
dependent variable for a one-unit change in that independent variable. 


Cov( X.Y) 
Var(X) 


a= Y 3X 
LO 18.c 


Assumptions made with linear regression include the following: 


= The expected value of the error term, conditional on the independent variable, is 
Zero. 


= All (X, Y) observations are independent and identically distributed (i.i.d.). 
= It is unlikely that large outliers will be observed in the data. 
= The variance of X is strictly > 0. 


= The variance of the errors is constant (i.e, homoskedasticity). 


LO 18.d 
The OLS estimators, a and ßf, are unbiased and consistent estimators of respective 
population parameters and their sampling distribution is approximately normal. 


LO 18.e 
To conduct tests of hypothesis for the form such as Ho: P = By versus Ha: B # Bo, 


3-8 


0 


we use the following test statistic: 


>r 
If the absolute value of t exceeds the critical t-value (from the t-table, n - 2 degrees of 
freedom), we would reject the null hypothesis. 


The confidence interval of the slope coefficient = P + (t, x Sp). 


LO 18.f 

Steps in hypothesis testing for linear regression: 
1. Specify the hypothesis. 

2. Calculate the test statistic. 


3. Reject or fail to reject the null hypothesis after comparing the test statistic to its 
critical value. 


LO 18.9 

The confidence interval for the slope coefficient and t-test for the hypothesized value of 
the slope coefficient are related; if the hypothesized value falls within the confidence 
interval for the slope coefficient, we fail to reject the null hypothesis. If the p-value is 
less than the significance level, the null hypothesis can be rejected, otherwise we fail to 
reject the null. 


LO 18.h 


The R? represents the proportion of variation in the dependent variable that is 
explained by the independent variable(s). 


a 
R =r xy 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 18.1 


1.B The regression equation can be written as: E(Y) = a + B x X. If X = 0, then Y =a 
(i.e. the intercept coefficient). (LO 18.a) 


2.A The error term represents effects from independent variables not included in the 
model. It could be explained by additional independent variables. (LO 18.a) 


3.B Linear regression refers to a regression that is linear in the 
coefficients /parameters; it may or may not be linear in the variables, which can 
enter a linear regression after appropriate transformation. (LO 18.a) 


Module Quiz 18.2 


1.D OLS is a process that minimizes the sum of squared residuals to produce 


estimates of the population parameters known as sample regression coefficients. 
(LO 18.b) 


2.C The slope coefficient is best interpreted as the predicted change in the dependent 
variable for a one-unit change in the independent variable. If the slope coefficient 
estimate is 10.0 and the independent variable changes by one unit, the dependent 
variable will change by 10 units. The intercept term is best interpreted as the 


value of the dependent variable when the independent variable is equal to zero. 
(LO 18.b) 


3. D The reliability of the slope coefficient is inversely related to its variance and the 
variance of the slope coefficient (8) increases with variance of the error term and 
decreases with the variance of the explanatory variable. (LO 18.d) 

Cov(X,Y) %XY 45.76 


4.A B =————_ = = = 2.72 
Var X o2 16.84 


a= Y — 8X = 0.01 — 2.72 x 0.044 = —0.11 
(LO 18.b) 


5.D When the slope coefficient is 0, variation in Y is unrelated to variation in X and 
correlation ry y = 0. Therefore, R? = ryy = 0. 


Cov(X,Y) 


Alternatively, recall that 8 =————_. If 8 = 0, Cov(X,Y) = 0 and 
Var X 


therefore: 


Cov(X,Y) 0 
Ty y = -m T> 

‘ Gy Ty ox oy 
(LO 18.h) 


= 0 


Module Quiz 18.3 


1.A The confidence interval of the slope coefficient = B + (t, x Sp) = 1.08 + (2.02 x 0.67) 
or -0.27 to 2.43. Notice that 0 falls within this interval and, hence, the coefficient 


is not significantly different from 0 at 5% level of significance. The p-value of 0.11 
(> 0.05) also gives the same conclusion. (LO 18.e) 


2.C The p-value provided is for hypothesized value of the slope coefficient being equal 
to 0. The hypothesized coefficient value is 0.20. 


- Bo 0.48 — 0.20 0.68 R 
t = — =- — 5.67 
S, 0.12 0.12 
(LO 18.g) 


3.C When the p-value is less than the level of significance, the slope coefficient is 


significantly different from 0. For the test of hypothesis about coefficient value 
significantly different from 0: 


B—Bo 133-0 
t= =- = 2.96 
S, 0.45 


The confidence interval of the slope coefficient = B + (t, x Sp) = 1.33 + (2.02 x 0.45) or 0.42 to 
2.34. 0 is not in this confidence interval. 


For hypothesis test of coefficient is equal to 1: 
B-Bo 133-1 
t= - = — = 0.73 
S, 0.45 


(LO 18.g) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 8. 


READING 19 


REGRESSION WITH MULTIPLE 
EXPLANATORY VARIABLES 


Study Session 6 


EXAM FOCUS 


In this reading, we generalize the regression model to include multiple explanatory 
variables. For the exam, be able to evaluate and calculate goodness-of-fit measures such 
as R? and adjusted R as well as hypothesis testing related to these concepts. 
Hypothesis testing of individual slope coefficients in a multiple regression model as 
well as confidence intervals of those coefficients is also important testable material. 


MODULE 19.1: MULTIPLE REGRESSION 


LO 19.a: Distinguish between the relative assumptions of single and multiple 
regression. 


We extend our regression function in this reading to include multiple explanatory 
variables (which is most commonly used in practice). The general form of this multiple 
regression model is: 
Y=a+8,X,+8,X,+...+BX, +e 
where: 
3, = regression or slope coefficients; sensitivity of Y to changes in X; controlling 
for all other Xs 
a = value of Y when all Xs = 0 
€ = random error or shock; unexplained (by X) component of Y (This error 
may be reduced by using more independent variables or by using different, 
more appropriate independent variables.) 


Recall the assumptions of single regression model (modified for multiple Xs): 


1. The expected value of the error term, conditional on the independent variables, is 
zero: [E(€;|X;,) = 0]. 


2. All (Xs and Y) observations are i.i.d. 
3. The variance of X is positive (otherwise estimation of B would not be possible). 


4. The variance of the errors is constant (i.e., homoskedasticity). 


5. There are no outliers observed in the data. 


An additional sixth assumption is needed for multiple regression: 


6. X variables are not perfectly correlated (i.e., they are not perfectly linearly 
dependent). In other words, each X variable in the model should have some variation 
that is not fully explained by the other X variables. 


LO 19.b: Interpret regression coefficients in a multiple regression. 


For a multiple regression, the interpretation of the slope coefficient is that it captures 
the change in the dependent variable for a one-unit change in the independent variable, 
holding the other independent variables constant. As a result, the slope coefficients in a 
multiple regression are sometimes called partial slope coefficients. 


The ordinary least squares (OLS) estimation process for multiple regression differs 
from single regression. In a stepwise fashion, first, the individual explanatory variables 
are regressed against other explanatory variables and the residuals from these models 
become explanatory variables in the regression using the original independent variable. 
Consider a simple, two-independent-variable model: 


¥,=a+ 4 KX + 3, X Xut 


In Step 1, we estimate the residuals in the following model using OLS estimation 
techniques discussed previously (the estimated coefficients a and b are not actually 
used) for a single regression: 


X,,=a+b x X, + 0; 
In Step 2, we do the same, but this time estimate the residuals in the model: 
Y=c+d n Xy tN 


Finally, the residuals from Step 2 are regressed against the residuals from Step 1 to 
estimate the slope coefficient B,: 


p= Py * OF E; 
This stepwise estimation process ensures that the slope coefficient ($4) is calculated 
after controlling for the variation in the other independent variable (X-). By reversing 
the process, we can similarly estimate B, after controlling for X4. 


Interpreting Multiple Regression Results 


Now let’s discuss the interpretation of the multiple regression slope coefficients in 
more detail. Suppose we run a regression of the dependent variable Y on a single 
independent variable X, and get the following result: 


Y=2.0+4.5X, 


The appropriate interpretation of the estimated slope coefficient is that if X, increases 
by 1 unit, we would expect Y to increase by 4.5 units. 


Now suppose we add a second independent variable X, to the regression and get the 
following result: 


Y = 1.0 + 2.5X, + 6.0X, 
Notice that the estimated slope coefficient for X4 changed from 4.5 to 2.5 when we 
added X, to the regression. We would expect this to happen most of the time when a 
second variable is added to the regression, unless X, is uncorrelated with X4, because if 
X, increases by 1 unit, then we would expect X, to change as well. The multiple 
regression equation captures this relationship between X, and X, when predicting Y. 


Now the interpretation of the estimated slope coefficient for X, is that if X, increases 
by 1 unit, we would expect Y to increase by 2.5 units, holding X, constant. 


As usual, the intercept (1.0) is interpreted as the expected value of Y when all Xs are 
equal to zero. 


EXAMPLE: Regression model with multiple explanatory variables 


A researcher estimated the following three-factor model to explain the return on 
different portfolios: 


Rp; = 1.70 + 1.03 Rm; - 0.23R,,; + 0.32 R,, 


(Note: Returns are expressed in percentage form.) 


Calculate the following: 
1. The return on a portfolio when Rn = 8%, R, = 2% and R, = 3% 


2. The impact on portfolio return if R, declines by 1% 

3. The expected return on the portfolio when Rm = Rz = R, =0 
Answer: 

1. E(Rp) = 1.70 + (1.03 x 8) - (0.23 x 2) + (0.32 x 3) = 10.44% 
2. Change in portfolio return = ARp = - (0.23 x -1) = +0.23% 
3. E(Rp) when all factors = 0 would be the intercept = 1.70% 


E 


MODULE QUIZ 19.1 
Use the following information to answer Questions 1 and 2. 
Multiple regression was used to explain stock returns using the following variables: 


Dependent variable: 
RET =annual stock returns (%) 


Independent variables: 
MKT =market capitalization = market capitalization / $1.0 million. 
IND =industry quartile ranking (IND = 4 is the highest ranking) 


FORT = Fortune 500 firm, where {FORT = 1 if the stock is that ofa 
Fortune 500 firm, FORT = 0 if not a Fortune 500 stock} 


The regression results are presented in the following table. 


Coefficient aT t-Statistic pValue 
Intercept 0.5220 1.2100 0.430 0.681 
Market capitalization 0.0460 0.0150 3.090 0.021 
Industry ranking 0.7102 0.2725 2.610 0.040 
Fortune 500 0.9000 0.5281 1.700 0.139 


1. Based on the results in the table, which of the following most accurately represents 
the regression equation? 
A. 0.43 + 3.09(MKT) + 2.61(IND) + 1.70(FORT). 
B. 0.681 + 0.021(MKT) + 0.04(IND) + 0.139(FORT). 
C. 0.522 + 0.0460(MKT) + 0.7102(IND) + 0.9(FORT). 
D. 1.21 + 0.015(MKT) + 0.2725(IND) + 0.5281(FORT). 


2. The expected amount of the stock return attributable to it being a Fortune 500 
stock is closest to: 
A. 0.522. 
B. 0.046. 
C. 0.710. 
D. 0.900. 


3. Which of the following is not an assumption of single regression? 


A. There are no outliers in the data. 
B. The variance of the independent variables is greater than zero. 
C. Independent variables are not perfectly correlated. 


D. Residual variance are homoskedastic. 


MODULE 19.2: MEASURES OF FIT IN LINEAR 
REGRESSION 


LO 19.c: Interpret goodness-of-fit measures for single and multiple regressions, 
including R? and adjusted R°. 


LO 19.e: Calculate the regression R? using the three components of the 
decomposed variation of the dependent variable data: the explained sum of 
squares, the total sum of squares, and the residual sum of squares. 


The standard error of the regression (SER) measures the uncertainty about the 
accuracy of the predicted values of the dependent variable. Graphically, the relationship 
is stronger when the actual x,y data points lie closer to the regression line (i.e. the 
errors are smaller). 


Recall that OLS estimation minimizes the sum of the squared differences between the 
predicted value and actual value for each observation. Also, recall that the regression 
model seeks to explain the variation in Y: 

Ev, -Y 
We can write the deviation from mean for Y as: 


Y,-Y = (¥,-Y) + (y,-¥,) 


Therefore, 
' —2 s 2 i a2 
XY; Yj = XY; Y) + (Y;- Y) 
or 
TSS = ESS + RSS 
where: 


TSS = total sum of squares (i.e., total variation in Y) 

ESS = explained sum of squares (i.¢., the variation in Y explained by the 
regression model) 

RSS = residual sum of squares (i.e , the unexplained variation in Y) 


Figure 19.1 illustrates how the total variation in the dependent variable (TSS) is 
composed of RSS and ESS. 


Figure 19.1: Components of the Total Variation 


Y 


Y =b, +b X : (Y, — Y,) =— RSS 


Coefficient of Determination 
Dividing both sides by TSS, we see that 1 = (ESS/TSS) + (RSS/TSS) 


The first term on the right side captures the proportion of variation in Y that is 
explained. This proportion is the coefficient of determination (R?) of a multiple 
regression and is a goodness-of-fit measure. 


R? = ESS/TSS = % of variation explained by the regression model 


Recall that for a single regression, R? = ryy. For a multiple regression, R? = r° (yọ). 


For a multiple regression, the coefficient of determination R? is the square of the 
correlation between Y and predicted value of Y. While it is a goodness-of-fit measure, 
R? by itself may not be a reliable measure of the explanatory power of the multiple 


regression model due to three reasons. First, R? almost always increases as independent 
variables are added to the model, even if the marginal contribution of the new variables 


is not statistically significant. Consequently, a relatively high R* may reflect the impact 
of a large set of independent variables rather than how well the set explains the 
dependent variable. This problem is often referred to as overestimating the regression. 


Adjusted R° 


To overcome the problem of overestimating the impact of additional variables on the 
explanatory power of a regression model, many researchers recommend adjusting R? 
for the number of independent variables. The adjusted R? value is expressed as: 


a ii (jars z) * R3] 


where 

n = number of observations 

k = number of independent variables 
Note that R? will be less than or equal to R*. So, while adding a new independent 
variable to the model will increase R?, it may either increase or decrease the R? If the 
new variable has only a small effect on R°, the value ofR? may decrease. In addition, R? 


may be less than zero if the R? is low enough. 


Second, RÊ is not comparable across models with different dependent (i.e, Y) variables. 
Finally, there are no clear predefined values of R? that indicate whether the model is 
good or not. For some noisy variables (e.g, currency values), even models with a low R? 
may provide valuable insight. 


EXAMPLE: Calculating R° and adjusted R? 


An analyst runs a regression of monthly value-stock returns on 5 independent 
variables over 60 months. The total sum of squares for the regression is 460, and the 


residual sum of squares is 170. Calculate the R? and adjusted R. 


Answer: 
A 460 — 170 
R4 = ——— = 0.630 = 63.0% 
460 
z 60— 1 : 
R* = 1 (=) x (1 — 0.63)| = 0.596 = 59.6% 
s 60 -—5-l1, 


The R2 of 63% suggests that the five independent variables together explain 63% of 
the variation in monthly value-stock returns. 


EXAMPLE: Interpreting adjusted R? 


Suppose the analyst now adds four more independent variables to the previous 
regression, and the R? increases to 65.0%. Identify which model the analyst would 
most likely prefer. 


Answer: 


With nine independent variables, even though the R? has increased from 63% to 
65%, the adjusted R? has decreased from 59.6% to 58.7%: 


: / 60-1 
R? = 1- (_—_— } x (1 — 0.65) = 0.587 = 58.7% 


60 —9—1 


The analyst would prefer the first model because the adjusted R? is higher and the 
model has five independent variables as opposed to nine. 


Joint Hypothesis Tests and Confidence Intervals 


LO 19.d: Construct, apply, and interpret joint hypothesis tests and confidence 
intervals for multiple coefficients in a regression. 


As with single regression, the magnitude of the coefficients in a multiple regression 
tells us nothing about the importance of the independent variable in explaining the 
dependent variable. Thus, we must conduct hypothesis testing on the estimated slope 
coefficients to determine if the independent variables make a significant contribution 
to explaining the variation in the dependent variable. 


The t-statistic used to test the significance of the individual coefficients in a multiple 
regression is calculated using the same formula that is used with single regression: 


b; B, estimated regression coefficient — hypothesized value 


Si, coefficient standard error of b, 


For a multiple regression, the t-statistic has (n - k - 1) degrees of freedom. 


Determining Statistical Significance 


The most common hypothesis test done on the regression coefficients is to test 
statistical significance, which means testing the null hypothesis that the coefficient is 
zero versus the alternative that it is not: 


testing statistical significance > Ho: bj = 0 versus H,: bj # 0 


EXAMPLE: Testing the statistical significance of a regression coefficient 


Consider the hypothesis that future 10-year real earnings growth in the S&P 500 
(EG10) can be explained by the trailing dividend payout ratio of the stocks in the 
index (PR) and the yield curve slope (YCS). Test the statistical significance of the 
independent variable PR in the real earnings growth example at the 10% significance 
level. Assume that the number of observations is 46 and the critical t-value for 10% 
level of significance is 1.68. The results of the regression are produced in the 
following table. 


Coefficient and Standard Error Estimates for Regression of EG10 on PR and 
YCS 


Coefficient Standard Error 


Intercept -11.6% 1.657% 

PR 0.25 0.032 

MES 0.14 0.280 
Answer: 


We are testing the following hypothesis: 
Hg: PR = 0 versus Ha: PR # 0 


The t-statistic is: 
0.25 


0.032 


-t 


Therefore, because the t-statistic of 7.8 is greater than the upper critical t-value of 
1.68, we can reject the null hypothesis and conclude that the PR regression 
coefficient is statistically significantly different from zero at the 10% significance 
level. 


Similar to single regression, the confidence interval for a regression coefficient in 
multiple regression is calculated as: 


b. Ł (te x ial 


For models with multiple variables, the univariate t-test is not applicable when testing 
complex hypotheses involving the impact of more than one variable. Instead, we use the 
F-test. 


The F-Test 


An F-test is useful to evaluate a model against other competing partial models. For 
example, a model with three independent variables (X4, X, and X3) can be compared 


against a model with only one independent variable (X4). We are trying to see if the two 
additional variables (X, and X3) in the full model contribute meaningfully to explain the 
variation in Y. 


Ho: Bz = B3 = 0 versus Hy: either B, # 0 or B3 # 0 


The F-statistic for multiple regression coefficients, which is always a one-tailed test, is 
calculated as: 


(RSS, — RSS,) /q (R2— R32) /q 


F = —_—_ = 
RSS¢/(n—k,—1) (1—R2)/(n—k,- 1) 


where: 
RSS, = residual sum of squares of the full model 


RSS» = residual sum of squares of the partial model 

Rp =coefficient of determination of the full model 

R$ = coefficient of determination of the partial model 

q = number of restrictions imposed on the full model to arrive at the partial 
model 

n = number of observations 

k; = number of independent variables in the full model 


The calculated F-statistic is compared to the critical F-value [with q degrees of 
freedom in the numerator and (n - kp - 1) degrees of freedom in the denominator]. If 
the calculated F-stat is greater than the critical F-value, the full model contributes 
meaningfully to explaining the variation in Y. 


EXAMPLE: F-test 


A researcher is seeking to explain returns on a stock using the market returns as an 
explanatory variable (CAPM formulation). The researcher wants to determine 
whether two additional explanatory variables contribute meaningfully to variation 
in the stock’s return. Using a sample consisting of 64 observations, the researcher 
found that RSS in the model with three explanatory variables is 6,650 while the RSS 
in the single-variable model is 7,140. Evaluate the model with extra variables 
relative to the standard CAPM formulation. 


Answer: 


Given, RSS, = 6,650; RSSp = 7,140; n = 64; kẹ = 3; and q = number of variables 
removed = 2, the F-test statistic is computed as: 


(7,140 — 6,650) /2 
F = —  —_= 271 
6,650/(64 — 3— 1) 


Critical F-value at 5% level of significance (df numerator = 2, df denominator = 60) = 
lS 


Because F = 2.21 < 3.15, we can state that the full model does not contribute 
meaningfully to explaining variation in Y. In other words, we fail to reject the null 
hypothesis that ß, = 63 = 0. Note that one of the two variables removed from the full 


model may still be significant, but we are only concluding here that both variables 
are insignificant. 


A more generic F-test is used to test the hypothesis that all variables included in the 
model do not contribute meaningfully in explaining the variation in Y versus at least 
one of the variables does contribute statistically significantly. 


Ho: By = B2=B3=...=B, = 0 versus Hy: at least one B ; # 0 
In such a case, we calculate the F-statistic as follows: 
ESSA 
a RSS 
RS JA -pha 


The calculated F-statistic is then compared to critical F-value (with numerator degrees 
of freedom = k and denominator degrees of freedom =n - k - 1). If F-stat > critical F, 
we reject the null hypothesis. 


EXAMPLE: Calculating and interpreting the F-statistic 


An analyst runs a regression of monthly value-stock returns on five independent 
variables over 46 months. The total sum of squares is 460, and the residual sum of 
squares is 170. Test the null hypothesis at the 5% significance level (95% 
confidence) that all five of the independent variables are equal to zero. 


Answer: 


The null and alternative hypotheses are: 


Hy 8, = B, = B3 = B4 = B; = 0 versus H,: at least one 8, #0 


ESS = TSS — RSS = 460 — 170 = 290 
290/5 58 
170/(46— 5—1) 4.25 


The critical F-value for 5 and 40 degrees of freedom at a 5% significance level is 2.45. 
Therefore, we can reject the null hypothesis and conclude that at least one of the five 
independent variables is significantly different than zero. 


=y 


MODULE QUIZ 19.2 
Use the following information to answer Questions 1 and 2. 


Phil Ohlmer estimates a cross sectional regression in order to predict price to earnings 
ratios (P/E) with fundamental variables that are related to P/E, including dividend 
payout ratio (DPO), growth rate (G), and beta (B). In addition, all 50 stocks in the 
sample come from two industries, electric utilities or biotechnology. He defines the 
following dummy variable: 
IND =0 if the stock is in the electric utilities industry 

or 


= 1 if the stock is in the biotechnology industry 


The results of his regression are shown in the following table. 


Variable Coefficient ¢-Statistic 


Intercept 6.75 3.89* 
IND 8.00 4.50* 
DPO 4.00 1.86 

G 12.35 2.43* 


B -0.50 1.46 


*Significant at the 5% level 


1. Based on these results, it would be most appropriate to conclude that: 

A. biotechnology industry P/Es are statistically significantly larger than electric 
utilities industry P/Es. 

B. electric utilities P/Es are statistically significantly larger than biotechnology 
industry P/Es, holding DPO, G, and B constant. 

C. biotechnology industry P/Es are statistically significantly larger than electric 
utilities industry P/Es, holding DPO, G, and B constant. 

D. the dummy variable does not display statistical significance. 


2. Ohlmer is valuing a biotechnology stock with a dividend payout ratio of 0.00, a beta 
of 1.50, and an expected earnings growth rate of 0.14. The predicted P/E on the 
basis of the values of the explanatory variables for the company is closest to: 

A. 7.7. 
B. 15.7. 
C. 17.2. 
D. 11.3. 


3. When interpreting the R? and adjusted Rê measures for a multiple regression, which 
of the following statements incorrectly reflects a pitfall that could lead to invalid 
conclusions? 


A. The RÊ measure does not provide evidence that the most or least appropriate 
independent variables have been selected. 

B. If the R? is high, we have to assume that we have found all relevant independent 
variables. 


C. If adding an additional independent variable to the regression improves the R°, 
this variable is not necessarily statistically significant. 


D. The R? measure may be spurious, meaning that the independent variables may 
show a high RÊ; however, they are not the exact cause of the movement in the 


dependent variable. 


KEY CONCEPTS 


LO 19.a 

In addition to the assumptions of single regression, multiple regression requires that 
the X variables are not perfectly correlated (i.e., they are not perfectly linearly 
dependent). So, each X variable should have some variation that is not fully explained 
by the other X variables. 


LO 19.b 

For a multiple regression, the interpretation of the slope coefficient captures the 
change in dependent variable for one unit change in independent variable, holding the 
other independent variables constant. 


LO 19.c and 19.e 


The coefficient of determination (R?) of a multiple regression is a goodness-of-fit 
measure. 

R? = ESS/TSS = % of variation explained by the regression model 

where: 

TSS = total sum of squared (i.e., total variation in Y) 

ESS = explained sum of squared (i.c., the variation in Y explained by the 

regression model) 


Because R? almost always increases as independent variables are added to the model, 
to overcome the problem of overestimating the impact of additional variables we 


calculate adjusted RŽ as: 


(E 
where: 


n = number of observations 
k = number of independent variables 


x 
ew 
li 


LO 19.d 
The t-statistic used to test the significance of the individual coefficients in a multiple 
regression is calculated using the same formula that is used with simple linear 
regression: 

b; 7 B, estimated regression coefficient — hypothesized value 


t= — 


Sh; coefficient standard error of b, 


This t-statistic has n - k - 1 degrees of freedom. 


Similar to single regression, the confidence interval for a regression coefficient in 
multiple regression is calculated as: 


bj (1.55) 


An F-test is useful to evaluate a model against other competing partial models. 
(RSS,—RSS,)/q (R2 — R2)/q 
ce 
RSS /(n—kp—1) (1—R2)/(n—ky—1) 


Where F and P denote full and partial models, respectively. 


A more generic F-test for the hypothesis: Ho: $4 = B = B3 = ....= Bg = 0 versus H y: at 
least one B ; + 0 can be conducted using the following equation: 
A 
gi T E 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 19.1 


1.C The coefficients column contains the regression parameters. (LO 19.b) 


2.D The regression equation is 0.522 + 0.0460(MKT) + 0.7102(IND) + 0.9(FORT). The 
coefficient on FORT is the amount of the return attributable to the stock of a 
Fortune 500 firm. (LO 19.b) 


3.C This is an assumption for multiple regression and not for single regression. (LO 
19.a) 


Module Quiz 19.2 


1.C The t-statistic tests the null that industry P/Es are equal. The dummy variable is 
significant and positive, and the dummy variable is defined as being equal to one 
for biotechnology stocks, which means that biotechnology P/Es are statistically 
significantly larger than electric utility P/Es. Remember, however, this is only 
accurate if we hold the other independent variables in the model constant. (LO 
19.d) 


2.B Note that IND = 1 because the stock is in the biotech industry. Predicted P/E = 
6.75 + (8.00 x 1) + (4.00 x 0.00) + (12.35 x 0.14) - (0.50 x 1.5) = 15.7. (LO 19.b) 


3.B Ifthe R? is high, we cannot assume that we have found all relevant independent 
variables. Omitted variables may still exist, which would improve the regression 
results further. (LO 19.c) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 9. 


READING 20 
REGRESSION DIAGNOSTICS 


Study Session 6 


EXAM FOCUS 


This reading focuses on model specification issues and the determination of whether 
the assumptions underlying multiple regression are violated. For the exam, be able to 
explain the effects of heteroskedasticity and multicollinearity on a regression. Also, 
understand the bias-variance tradeoff and the consequences of including an irrelevant 
explanatory variable versus excluding a relevant explanatory variable. 


MODULE 20.1: HETEROSKEDASTICITY AND 
MULTICOLLINEARITY 


LO 20.a: Explain how to test whether a regression is affected by 
heteroskedasticity. 


LO 20.b: Describe approaches to using heteroskedastic data. 


If the variance of the residuals is constant across all observations in the sample, the 
regression is said to be homoskedastic. When the opposite is true, the regression 
exhibits heteroskedasticity, which occurs when the variance of the residuals is not the 
same across all observations in the sample. This happens when there are subsamples 
that are more spread out than the rest of the sample. 


Unconditional heteroskedasticity occurs when the heteroskedasticity is not related 
to the level of the independent variables, which means that it doesn’t systematically 
increase or decrease with changes in the value of the independent variable(s). While 
this is a violation of the equal variance assumption, it usually causes no major 
problems with the regression. 


Conditional heteroskedasticity is heteroskedasticity that is related to the level of 
(i.e conditional on) the independent variable. For example, conditional 
heteroskedasticity exists if the variance of the residual term increases as the value of 
the independent variable increases, as shown in Figure 20.1. Notice in this figure that 
the residual variance associated with the larger values of the independent variable, X, is 


larger than the residual variance associated with the smaller values of X. Conditional 
heteroskedasticity does create significant problems for statistical inference. 


Figure 20.1: Conditional Heteroskedasticity 


Y High residual 
Low residual variance 
Variance 


Y=b,+b,X 


Effect of Heteroskedasticity on Regression Analysis 
There are several effects of heteroskedasticity you need to be aware of: 
=a The standard errors are usually unreliable estimates. 


= The coefficient estimates (i.e., the b;) are still consistent and unbiased. 


= Because of unreliable standard errors, hypothesis testing is unreliable. 


Detecting Heteroskedasticity 


As shown in Figure 20.1, a scatterplot of the residuals versus one of the independent 
variables can reveal patterns among observations. Formally, a chi-squared test statistic 
can be computed as follows: 


1. Estimate the regression using standard ordinary least squares (OLS) procedures and 
estimate the residuals and square them (eẹ) 


2. Use the squared estimated residuals in Step 1 as the dependent variable in a new 
regression with the original explanatory variables. 


3. Calculate the R? for the model in Step 2 and use it to calculate the chi-squared test 
statistic: 


x = nR2 
The chi-squared statistic is compared to its critical value with [k x (k + 3) / 2] 
degrees of freedom, where k = number of independent variables. 


4. If the calculated x? > critical x, we reject the null hypothesis of no conditional 
heteroskedasticity. 


Correcting for Heteroskedasticity 


If conditional heteroskedasticity is detected, we can conclude that the coefficients are 
unaffected but the standard errors are unreliable. In such a case, revised, White 
standard errors should be used in hypothesis testing instead of the standard errors from 
OLS estimation procedures. 


LT PROFESSOR'S NOTE 
ê White standard errors are heteroskedasticity-consistent standard errors. The 
introduction of these robust standard errors is credited to Halbert White, a 
well-known professor in econometrics. 


LO 20.c: Characterize multicollinearity and its consequences, as well as 
distinguish between multicollinearity and perfect collinearity. 


Recall from the previous reading the additional assumption needed in multiple 
regression as opposed to a single regression: X variables are not perfectly correlated 
(i.e. they are not perfectly linearly dependent). When the X variables are perfectly 
correlated, it is called as perfect collinearity. This would be the case when one of the 
independent variables can be perfectly characterized by a linear combination of other 
independent variables (e.g., X3 = 2X, + 3X2). 


Multicollinearity refers to the condition when two or more of the independent 
variables, or linear combinations of the independent variables, in a multiple regression 
are highly correlated with each other. While multicollinearity does not represent a 
violation of regression assumptions, its existence compromises the reliability of 
parameter estimates. 


Effect of Multicollinearity on Regression Analysis 


As a result of multicollinearity, there is a greater probability that we will incorrectly 
conclude that a variable is not statistically significant (e.g. a Type II error). 
Multicollinearity is likely to be present to some extent in most economic models. The 
issue is whether the multicollinearity has a significant effect on the regression results. 


Detecting Multicollinearity 


The most common way to detect multicollinearity is the situation where t-tests 
indicate that none of the individual coefficients is significantly different than zero, 
while the R? is high (and the F-test rejects the null hypothesis). This suggests that the 
variables together explain much of the variation in the dependent variable, but the 
individual independent variables do not. The only way this can happen is when the 
independent variables are highly correlated with each other, so while their common 
source of variation is explaining the dependent variable, the high degree of correlation 
also “washes out” the individual effects. 


EXAMPLE: Detecting multicollinearity 


Bob Watson runs a regression of mutual fund returns on average P/B, average P/E, 
and average market capitalization, with the following results: 


Variable Coefficient p-Value 
Average P/B ESA 0.15 
Average P/E 2.78 0.21 
Market Cap 4.03 0.11 
R? 89.6% 


Determine whether or not multicollinearity is a problem in this regression. 


Answer: 


The RŽ is high, which suggests that the three variables as a group do an excellent job 
of explaining the variation in mutual fund returns. However, none of the independent 
variables individually is statistically significant to any reasonable degree, because 
the p-values are larger than 10%. This is a classic indication of multicollinearity. 


Another approach to identify multicollinearity is to calculate the variance inflation 


factor (VIF) for each explanatory variable. To do that, we calculate R? in the model 
using the subject explanatory variable (X;) as the dependent variable and the other X 


variables as independent variables: 
X; = by +b X, +... +b Xj + bi K+... + bX 


Aii f +1 oe 


This R2 is then used in the VIF formula as follows: 
l 
I — R? 


j 


VIF, = 
A VIF > 10 (i.e, R? > 90%) should be considered problematic for that variable. 


Correcting Multicollinearity 


The most common method to correct for multicollinearity is to omit one or more of 
the correlated independent variables. Unfortunately, it is not always an easy task to 
identify the variable(s) that are the source of the multicollinearity. There are statistical 
procedures that may help in this effort, like stepwise regression, which systematically 
remove variables from the regression until multicollinearity is minimized. 


=) MODULE QUIZ 20.1 
— 1. Effects of conditional heteroskedasticity include which of the following problems? 
I. The coefficient estimates in the regression model are biased. 
II. The standard errors are unreliable. 
A. I only. 
B. II only. 
C. Both I and IT. 


D. Neither I nor ITI. 


2. Der-See Hsu, researcher for Xiang Li Quant Systems, is using a multiple regression 
model to forecast currency values. Hsu determines that the chi-squared statistics 
calculated using the R? of the regression involving the squared residuals as 
dependent variable exceeds the chi-squared critical value. Which of the following is 
the most appropriate conclusion for Hsu to reach? 

A. Hsu should estimate the White standard errors for use in hypothesis testing. 
B. OLS estimates and standard errors are consistent, unbiased, and reliable. 

C. OLS coefficients are biased but standard errors are reliable. 

D. A linear model is inappropriate to model the variation in the dependent variable. 


3. Ben Strong recently joined Equity Partners as a junior analyst. Within a few weeks, 
Strong successfully modeled the movement of price for a hot stock using a multiple 
regression model. Beth Sinclair, Strong's supervisor, is in charge of evaluating the 
results of Strong's model. What is the most appropriate conclusion for Sinclair based 
on the variance information factor (VIF) for each of the explanatory variables 
included in Strong's model as shown here? 


Variable VIF 
X1 2.1 
X2 10.3 
x3 6.9 
A. Variables X1 and X2 are highly correlated and should be combined into one 


variable. 
B. Variable X3 should be dropped from the model. 
C. Variable X2 should be dropped from the model. 
D. Variables X1 and X2 are not statistically significant. 


4. Which of the following statements regarding multicollinearity is least accurate? 


A. Multicollinearity may be present in any regression model. 

B. Multicollinearity is not a violation of a regression assumption. 

C. Multicollinearity makes it difficult to determine the contribution to explanation 
of the dependent variable of an individual explanatory variable. 

b. If the t-statistics for the individual independent variables are insignificant, yet 
the F-statistic is significant, this indicates the presence of multicollinearity. 


MODULE 20.2: MODEL SPECIFICATION 


LO 20.d: Describe the consequences of excluding a relevant explanatory variable 
from a model and contrast those with the consequences of including an 
irrelevant regressor. 


PROFESSOR'S NOTE 
* A regressor is often used as a term to describe an independent (or X) variable. 


Model specification is an art requiring a thorough understanding of the underlying 
economic theory that explains the behavior of the dependent variable. For example, 
many factors may influence short-term interest rates, including inflation rate, 
unemployment rate, GDP growth rate, capacity utilization, and so forth. Analysts trying 


to model a variable need to determine the factors that should be included/excluded in 
their model. 


While including irrelevant/extraneous variables does not pose any serious challenges, 
the model’s adjusted R? declines (recall that unless a variable contributes meaningfully 
to explaining the variation in Y, its inclusion reduces the adjusted R’). 


Omitting relevant factors from an ordinary least squares (OLS) regression can produce 
misleading or biased results. Omitted variable bias is present when two conditions 
are met: (1) the omitted variable is correlated with other independent variables in the 
model, and (2) the omitted variable is a determinant of the dependent variable. When 
relevant variables are absent from a linear regression model, the results will likely lead 
to incorrect conclusions, as the OLS estimators may not accurately portray the actual 
data. 


The coefficients of the included variables that are correlated with the omitted variable 
will partly (depending on the correlation between them) pick up the impact of the 
omitted variable (leading to biased estimates of coefficients of those variables). 
Furthermore, the uncorrelated portion of the omitted variable’s influence on the 
dependent variable gets captured by the error, magnifying it. 


The issue of omitted variable bias occurs regardless of the size of the sample and will 
make OLS estimators inconsistent. The correlation between the omitted variable and 
the included independent variables will determine the size of the bias (i.e., a larger 
correlation will lead to a larger bias) and the direction of the bias (i.e., whether the 
correlation is positive or negative). The coefficients of the included independent 
variables therefore would be biased and inconsistent. 


Bias-Variance Tradeoff 


LO 20.e: Explain two model selection procedures and how these relate to the 
bias-variance trade-off. 


The holy grail of model specification is selecting the appropriate explanatory variables 
to include in the model. Models with too many explanatory variables (i.e., overfit 
models) may explain the variation in dependent variable well in-sample, but perform 
poorly out-of-sample. Overfit, larger models have lower bias and higher variance (i.e., 
estimation) errors due to inclusion of too many independent variables. Smaller, less 
complex models, on the other hand, have higher bias and lower variance errors (i.e., 
lower Rĉ). There are two ways to deal with this bias-variance tradeoff: 


1. General-to-specific model: involves starting with the largest model and then 
successively dropping independent variables that have the smallest absolute t- 
statistic. 


2.m-fold cross-validation: involves dividing the sample into m parts and then using 


(m-1) parts (known as the training set) to fit the model and the remaining part 
(known as the validation set) to use for out-of-sample validation. A set of candidate 


models are first determined and then tested using this procedure to find the optimal 
model—one which has the lowest out-of-sample error. 


Residual Plots 


LO 20.f: Describe the various methods of visualizing residuals and their relative 
strengths. 


Basic residual plots show the residuals on the y-axis and the predicted value of the 
dependent variable (¥) on the x-axis. Ideally, the residuals should be small in magnitude, 
and not related to any of the explanatory variables. Alternatively, standardized 
residuals (i.e., the residuals divided by their standard deviation) could be plotted on the 
y-axis. The magnitude of the residuals would then be standardized and any residual 
over +4 standard deviations would be considered problematic. 


Identifying Outliers 


LO 20.g: Describe methods for identifying outliers and their impact. 


Recall that one of the assumptions of linear regression is that there are no outliers in 
the sample data. This is because the presence of outliers skews the estimated regression 
parameters. Outliers, when removed, induce large changes in the value of the estimated 
coefficients. One metric to identify an outlier is Cook’s distance, which is computed as 
follows: 
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where: 
y-) = predicted value of y after dropping outlier observation j 
yj = predicted value of y without dropping any observation 
k = number of independent variables 
S? = squared residuals in the model with all observations 


Large values of Cook’s distance (i.e. D; > 1) indicate that the dropped observation was 
indeed an outlier. 


The Best Linear Unbiased Estimator 


LO 20.h: Determine the conditions under which OLS is the best linear unbiased 
estimator. 


For OLS to generate the best linear unbiased estimator (BLUE), the assumptions 
underlying the linear regression need to be satisfied. Specifically, the relationship 
between Y and X(s) should be linear and residuals should be homoskedastic (i.e., 
residual distribution should be identical), independent, and have an expected value of 
zero. The last few assumptions are summarized as £; ~ N(i.i.d.). 


If there are no outliers, and the residuals have an expected value of zero, we can relax 
the assumption of normality for the residual distribution. 


=} MODULE QUIZ 20.2 
1. The omitted variable bias results from: 
A. exclusion of uncorrelated independent variables. 
B. inclusion of uncorrelated independent variables. 
C. inclusion of correlated independent variables. 
D. exclusion of correlated independent variables. 


2. Which of the following statements about bias-variance tradeoff is most accurate? 
A. Models with a large number of independent variables tend to have a high bias 
error. 
B. High variance error results when the out-of-sample R? of a regression is high. 
C. Models with fewer independent variables tend to have a high variance error. 
D. General-to-specific model is one approach to resolve the bias-variance tradeoff. 


3. Evaluate the following statements: 
I. A high value of Cook's distance indicates the presence of an outlier. 
II. Cook's distance is inversely related to the squared residuals. 
A. Both statements are correct. 
B. Only Statement I is correct. 
C. Only Statement IT is correct. 
D. Both statements are incorrect. 


KEY CONCEPTS 


LO 20.a 


Conditional heteroskedasticity indicates that the variance of the residual term is 
conditioned on the value of the independent variable. Even though the coefficient 
estimates are unbiased and consistent, the estimated standard errors are unreliable in 
the presence of conditional heteroskedasticity. The results of any hypothesis testing are 
therefore unreliable. 


LO 20.b 


If conditional heteroskedasticity is detected, we can conclude that the coefficients are 
unaffected but the standard errors are unreliable. In such a case, revised, White 
estimated standard errors should be used in hypothesis testing instead of the standard 
errors from OLS procedures. 


LO 20.c 


When the X variables are perfectly correlated, it is called perfect collinearity. 
Multicollinearity refers to when two or more of the independent variables, or linear 
combinations of the independent variables, in a multiple regression are highly 
correlated with each other. As a result of multicollinearity, there is a greater 
probability that we will incorrectly conclude that a variable is not statistically 
significant (e.g, a Type II error). One of the clues for presence of multicollinearity is 
when there is a disconnect between t-tests for significance of individual slope 


coefficients and the F-test for the overall model. Alternatively, the variance inflation 
factor (VIF) for each explanatory variable can be calculated to indicate the presence of 
multicollinearity; a VIF >10 for a variable indicates the presence of multicollinearity. 


LO 20.d 
While including irrelevant/extraneous variables does not pose any serious challenges, 
the model’s adjusted R? declines. 


Omitting relevant factors from an ordinary least squares (OLS) regression can produce 
misleading or biased results. Omitted variable bias is present when two conditions are 
met: (1) the omitted variable is correlated with other independent variables in the 
model, and (2) the omitted variable is a determinant of the dependent variable. 


LO 20.e 

Bias-variance tradeoff involves selecting between overfit models with too many 
variables and higher complexity (i.e. high variance, but low bias) versus models with 
fewer explanatory variables and lower complexity (i.e. high bias, but low variance). 


LO 20.f 


Two methods of plotting residuals versus predicted y-values include raw residuals and 
standardized residuals. The benefit of using standardized residuals is that outliers can 
be quickly visualized when its value exceeds +4. 


LO 20.g 
Apart from residual plots, outliers can be identified via Cook’s distance as follows: 
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LO 20.h 

OLS generates best linear unbiased estimates (BLUE) when the residual variance is 
constant and has an expected value of zero (even if the distribution of residuals is not 
normal). 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 20.1 


1.B Effects of heteroskedasticity include the following: (1) The standard errors are 
usually unreliable estimates and (2) the coefficient estimates are not affected. (LO 
20.a) 


2.A Hsu’s test results indicate that the null hypothesis of no conditional 
heteroskedasticity should be rejected. In such a case, the OLS estimates of 
standard errors would be unreliable and Hsu should estimate White corrected 
standard errors for use in hypothesis testing. Coefficient estimates would still be 
reliable (i.e., unbiased and consistent). (LO 20.b) 


3.C VIF > 10 for independent variable X2 indicates that it is highly correlated with the 
other two independent variables in the model, indicating multicollinearity. One of 
the approaches to overcoming the problem of multicollinearity is to drop the 
highly correlated variable. (LO 20.c) 


4.A Multicollinearity will not be present in a single regression. While perfect 
collinearity is a violation of a regression assumption, the presence of 
multicollinearity is not. Divergence between t-test and F-test is one way to detect 
the presence of multicollinearity. Multicollinearity makes it difficult to precisely 
measure the contribution of an independent variable toward explaining the 
variation in the dependent variable. (LO 20.c) 


Module Quiz 20.2 


1.D Omitted variable bias results from excluding a relevant independent variable that 
is correlated with other independent variable. (LO 20.d) 


2.D Larger, overfit models have a low bias error (high R? in-sample but low RŽ out-of- 
sample). Smaller, parsimonious models have lower R? in-sample and a lower 
variance error. Two ways to resolve the bias-variance tradeoff are the general-to- 
specific model and m-fold cross-validation. (LO 20.e) 


3.A Both statements are correct. A high value of Cook’s distance for an observation (> 
1) indicates that it is an outlier. The squared residuals are in the denominator in 
the computation of Cook’s distance and, hence, are inversely related to the 
measure. (LO 20.g) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 10. 


READING 21 
STATIONARY TIME SERIES 


Study Session 7 


EXAM FOCUS 


In this reading, we learn to model the cyclical component of a time series using 
autoregressive (AR), moving average (MA), and autoregressive moving average (ARMA) 
processes. For the past values of a time series to serve as a guide for its future values, it 
is necessary that the time series is stationary (i.e., past patterns are expected to 
continue). For the exam, know the difference between an AR process and an MA process 
and how some series can be modeled best with a combination of the two. Finally, 
understand the model evaluation using residual autocorrelations. 


MODULE 21.1: COVARIANCE STATIONARY 


LO 21.a: Describe the requirements for a series to be covariance stationary. 


A time series is data collected over regular time periods (e.g., monthly S&P 500 
returns, quarterly dividends paid by a company, etc.). Time series data have trends (the 
component that changes over time), seasonality (systematic change that occur at 
specific times of the year), and cyclicality (changes occurring over time cycles). For 
this reading, we are concerned with the third component. This cyclical component can 
be decomposed into shocks and persistence components. While we discuss the seasonal 
component briefly at the end of the reading, for the most part, we will limit ourselves 
to linear models to model the persistence component. 


A process such as a time series must have certain properties if we want to forecast its 
future values based on its past values. In particular, it needs the relationships among its 
present and past values to remain stable over time. We refer to such a time series as 
being covariance stationary. 

To be covariance stationary, a time series must exhibit the following three properties: 
1. Its mean must be stable over time. 

2. Its variance must be finite and stable over time. 


3. Its covariance structure must be stable over time. 


Covariance structure refers to the covariances among the values of a time series at its 
various lags, which are a given number of periods apart at which we can observe its 
values. We use the lowercase Greek letter tau, q, to represent a lag. For example, t = 1 
refers to a one-period lag, comparing each value of a time series to its preceding value, 
and if t = 4 we are comparing values four periods apart along the time series. 


Autocovariance and Autocorrelation Functions 


LO 21.b: Define the autocovariance function and the autocorrelation function. 


The covariance between the current value of a time series and its value t periods in the 
past is referred to as its autocovariance at lag Tt. Its autocovariances for all t make up 
its autocovariance function. If a time series is covariance stationary, its 
autocovariance function is stable over time. That is, its autocovariance depends on the 
t we choose, but does not depend on the time over which we observe the series. 


As we often do when working with covariances, we can convert them to correlations to 
better interpret the strength of the relationships. To convert an autocovariance 
function to an autocorrelation function (ACF), we divide the autocovariance at each t 
by the variance of the time series. This gives us an autocorrelation for each qt that will 
be scaled between -1 and +1. 


A useful way to analyze an ACF is to display it on a graph. Figure 21.1 illustrates an 
example of an ACF. As can be seen in the graph, the autocorrelations approach zero as t 
gets large. This is always the case for a covariance stationary series. 


Figure 21.1: Autocorrelation Function 
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A related function is the partial autocorrelation function, which makes up the 
correlations for all lags after controlling for the values between the lags (think about 


coefficient values in a regression when including all the lags as independent variables). 


Si PROFESSOR'S NOTE 
These are partial in the sense that they are regressed one lag at a time. For 
example, if we regress a monthly time series against its year-ago values, we 
get a partial autocorrelation for t = 12 that does not account for any effects 
from other lags. We would be unlikely to get the same result for t = 12 if we 
ran a multiple regression that also included t = 1, t = 2, and so forth. 


While autocorrelations successively decline, it is not so for partial autocorrelations; 
partial autocorrelations experience a steep decline. Partial autocorrelations may be 
large only for a few lags and those lags become prime candidates for inclusion in an 
autoregressive (AR) model, discussed later. 


White Noise 


LO 21.c: Define white noise, and describe independent white noise and normal 
(Gaussian) white noise. 


A time series might exhibit zero correlation among any of its lagged values. Such a time 
series is said to be serially uncorrelated. A special type of serially uncorrelated series 
is one that has a mean of zero and a constant variance. This condition is referred to as 


white noise, or zero-mean white noise, and the time series is said to follow a white 
noise process. 


If the observations in a white noise process are independent, as well as uncorrelated, 
the process is referred to as independent white noise. If the process also follows a 
normal distribution, it is known as normal white noise or Gaussian white noise. Not 
all independent white noise processes are normally distributed, but all normal white 
noise processes are also independent white noise. 


Graphically, a white noise process resembles Figure 21.2, with no identifiable patterns 
among the time periods. 


Figure 21.2: White Noise Process 
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One important purpose of the white noise concept is to analyze a forecasting model. A 
model’s forecast errors should follow a white noise process. If they do not, the errors 
themselves can be forecasted based on their past values. This implies that the model is 
inaccurate in a predictable way and is therefore inadequate; it needs to be revised, 
perhaps by adding more lags. 


Earlier, we stated that a white noise process has a mean of zero and a constant 
variance; this refers to its unconditional mean and variance. A process may have a 
conditional mean and variance that are not necessarily constant. That is, the expected 
value of the next observation in the series might not be the mean of the time series if 
the next observation is conditional on one or more of its earlier values. If such a 
relationship exists, we can use it for forecasting the time series. 


For an independent white noise process, we can say the next value in the series has no 
conditional relationship to any of its past values. Therefore, its conditional mean is the 
same as its unconditional mean. In this case, we cannot forecast based on past values. 


Wold’s theorem proposes a way to model the role of white noise and holds that a 
covariance stationary process can be modeled as an infinite distributed lag of a white 
noise process. Such a model would take the following form: 
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y, = ©, +b, E 


Where the b variables are constants and €, is a white noise process. 


Because this expression can be applied to any covariance stationary series, it is known 
as a general linear process. 


=) MODULE QUIZ 21.1 
- 1. The conditions for a time series to exhibit covariance stationarity are least likely to 
include: 
A. a stable mean. 
B. a finite variance. 
C. a finite number of observations. 
D. autocovariances that do not depend on time. 


2. As the number of lags or displacements becomes large, autocorrelation functions 
(ACFs) will approach: 
A. -1. 
B. O. 
C. 0.5. 
D. +1. 


3. Which of the following statements about white noise is most accurate? 


A. All serially uncorrelated processes are white noise. 
B. All Gaussian white noise processes are independent white noise. 
C. All independent white noise processes are Gaussian white noise. 


D. All serially correlated Gaussian processes are independent white noise. 


MODULE 21.2: AUTOREGRESSIVE AND MOVING 
AVERAGE MODELS 


Autoregressive Processes 
LO 21.d: Define and describe the properties of autoregressive (AR) processes. 


LO 21.g: Explain mean reversion and calculate a mean-reverting level. 


LO 21.m: Describe the role of mean reversion in long-horizon forecasts. 


Autoregressive models are the most widely applied time series models in finance. The 
first-order autoregressive [AR(1)] process is specified in the form of a variable 
regressed against itself in lagged form. This relationship can be shown in the following 
formula: 


y, = d+ @y, 17 &, 


where: 

d = intercept term 

y, = time series variable being estimated 

Y, = one-period lagged observation of the variable being estimated 
€, = current random white noise shock (mean 0) 


® = coefficient for the lagged observation of the variable being estimated 


In order for an AR(1) process to be covariance stationary, the absolute value of the 
coefficient on the lagged operator must be less than one (i.e., || < 1). Similarly, for an 
AR(p) process, the sum of all coefficients should be less than 1. 


The long-run (or unconditional) mean reverting level of an AR(1) series = 


d 
ee eT 
The long-run (or unconditional) mean reverting level of an AR(p) series = 
d 


This mean reverting level acts as an attractor such that the time series moves toward 
its mean over time. 


BY 


_ Similarly, we can calculate the 


For an AR(1) process, the variance of y= 
= 1-8 
variance of an AR(p) process by subtracting all the squared coefficients in the 


denominator. 


For example, if we are modeling daily demand for ice cream, we would forecast our 
current period daily demand (y,) as a function of a coefficient (®) multiplied by our 


lagged daily demand for ice cream (y,_;) and then add a random error shock (€,). This 


process enables us to use a past observed variable to predict a current observed 
variable. 


To estimate the autoregressive parameters, such as the coefficient (®), forecasters need 
to accurately estimate the autocovariance function of the data series: 


et 
y, = OM yo 


The Yule-Walker equation is used for this purpose. When using the Yule-Walker 
concept to solve for the autocorrelations of an AR(1) process, we use the following 
relationship: 


p, = ©!" for t = 0,1,2.... 


The significance of the Yule-Walker equation is that for autoregressive processes, the 
autocorrelation decays geometrically to zero as t increases. 


Consider an AR(1) process that is specified using the following formula: 


L, = 5y 3 
y, =0.65y, ,+€, 


The coefficient (®) is equal to 0.65; the first-period autocorrelation is 0.65 (i.e. 0.651); 


the second-period autocorrelation is 0.4225 (i.e, 0.657); and so forth for the remaining 
autocorrelations. 


It should also be noted that if the coefficient (®) were to be a negative number, perhaps 
-0.65, then the decay would still occur, but the value would oscillate between negative 
and positive numbers. This is true because -0.65° = -0.2746, -0.65* = 0.1785, and -0.65° 
= -0.1160. You would still notice the absolute value decaying, but the actual 
autocorrelations would alternate between positive and negative numbers over time. 


Moving Average (MA) Processes 


LO 21.e: Define and describe the properties of moving average (MA) processes. 


Conceptually, an MA process is a linear regression of the current values of a time series 
against both the current and previous unobserved white noise error terms, which are 
random shocks. MAs are always covariance stationary. The first-order moving 
average [MA(1)] process can be defined as: 
Y, =H + Ge, +E, 
where: 
= mean of the time series 
= current random white noise shock (mean 0) 
, = one-period lagged random white noise shock 
= coefficient for the lagged random shock 
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The MA(1) process is considered to be first-order because it only has one lagged error 
term (&_:). This yields a very short-term memory because it only incorporates what 
happens one period ago. If we ignore the lagged error term for a moment and assume 
that £, > 0, then y, > 0. This is equivalent to saying that a positive error term will yield a 


positive dependent variable (y,). When adding back the lagged error term, we are now 


saying that the dependent variable is impacted by not only the current error term, but 
also the previous period’s unobserved error term, which is amplified by a coefficient 
(8). Consider an example using daily demand for ice cream (y,) to better understand 
how this works: 


y, = 5,000 + 0.3¢, 
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In this equation, the error term is the daily change in demand. Using only the current 
period’s error term (€,), if the daily change is positive, then we would estimate that 
daily demand for ice cream would also be positive. But, if the daily change yesterday 
(€,_,) was also positive, then we would expect an amplified impact on our daily demand 


by a factor of 0.3. If the coefficient 0 is negative, the series aggressively mean reverts 
because the effect of the previous shock reverts in the current period. 


One key feature of MA processes is called the autocorrelation (p) cutoff. We would 
compute the autocorrelation using the following formula: 


8, 


-; where p =O fort > 1 
1+ 8; 


Using the previous example with 6 = 0.3, we would compute the autocorrelation to be 
0.2752 as follows: 


0.3 


1+ 0.3? 
For any value beyond the first lagged error term, the autocorrelation will be zero in an 
MA(1) process. This is important because it is one condition of being covariance 
stationary (i.e, mean = 0, variance = 0”), which is a condition of this process being a 
useful estimator. 


It is also important to note that this moving average representation has both a 

current random shock (€,) and a lagged unobservable shock (¢€,_;) on the independent 

side of this equation. This presents a problem for forecasting in the real world because 

it does not incorporate observable shocks. 

A more general form of a moving average process, MA(q), incorporates q lags: 
y,=Ht+e,+ Oe, ,+.....+ 8,€, q 


The mean of the MA(q) process is still u but the variance will change to: 


02 =07(1 +0, +6, +... +07) 
Lag Operators 


LO 21.f: Explain how a lag operator works. 


A commonly used notation for time series modeling is the lag operator (L). If y, is the 


value of a time series at time t, and Y+- is its value one period earlier, we can express a 
lag operator as: 


There are six properties of the lag operator: 

1. It shifts the time index back by one period. 

2. To apply the lag operator over multiple periods, Ly, = Y, m 

3. When applied to a constant, the lag operator does not change the constant. 


4. Forecasting models often take the form of a distributed lag that assigns weights to 
the past values of a time series. For example, suppose we have the following model: 


y,+ 0.7y, ,+0.4y, ,+0.2y, ,; 
Using lag operators in this model (known as a lag polynomial), it would be expressed 
as: 
(1+ 0.7L + 0.4L? + 0.2L3)y 
5. Lag polynomials can be multiplied. 
6. Assuming that the coefficients satisfy some conditions, the polynomial can be 
inverted. 


There are two main purposes of using a lag operator. First, an AR process is covariance 
stationary only if its lag polynomial is invertible. Second, this invertibility is used to 
select the appropriate time series model among equivalent models by applying what is 
known as the Box-Jenkins methodology. 


=) MODULE QUIZ 21.2 
= 1. Which of the following conditions is necessary for an autoregressive (AR) process to 
be covariance stationary? 
A. The value of the lag slope coefficients should add to 1. 
B. The value of the lag slope coefficients should all be less than 1. 
C. The absolute value of the lag slope coefficients should be less than 1. 
D. The sum of the lag slope coefficients should be less than 1. 


2. Which of the following statements is a key differentiator between a moving average 
(MA) representation and an autoregressive (AR) process? 
A. An MA representation shows evidence of autocorrelation cutoff. 
B. An AR process shows evidence of autocorrelation cutoff. 
C. An unadjusted MA process shows evidence of gradual autocorrelation decay. 
D. An AR process is never covariance stationary. 


3. Assume in an autoregressive [AR(1)] process that the coefficient for the lagged 
observation of the variable being estimated is equal to 0.75. According to the Yule- 
Walker equation, what is the second-period autocorrelation? 

A. 0.375. 
B. 0.5625. 
C. 0.75. 
D. 0.866. 


4. Which of the following statements is most likely a purpose of the lag operator? 
A. A lag operator ensures that the parameter estimates are consistent. 
B. An autoregressive (AR) process is covariance stationary only if its lag polynomial 


is invertible. 
C. Lag polynomials can be multiplied. 


D. A lag operator ensures that the parameter estimates are unbiased. 


MODULE 21.3: AUTOREGRESSIVE MOVING AVERAGE 
(ARMA) MODELS 


LO 21.h: Define and describe the properties of autoregressive moving average 
(ARMA) processes. 


So far, we have examined MA processes and AR processes assuming they interact 
independently of each other. While this may be the case, it is possible for a time series 
to show signs of both processes and theoretically capture a still richer relationship. For 
example, stock prices might show evidence of being influenced by both unobserved 
shocks (the MA component) and their own lagged behavior (the autoregressive 
component). This more complex relationship is called an autoregressive moving 
average (ARMA) process and is expressed by the following formula: 


y, =d+ Oy, ı +E, + Ge, 1 

where: 

d = intercept term 

y, = time series variable being estimated 

® = coefficient for the lagged observations of the variable being estimated 
Y,;_; = One-period lagged observation of the variable being estimated 

e, = current random white noise shock 

8 = coefficient for the lagged random shocks 

€,_, = one-period lagged random white noise shock 


You can see that the ARMA specification merges the concepts of an AR process and an 
MA process. In order for the ARMA process to be covariance stationary, which is 
important for forecasting, we must still observe that || < 1. Just as with the AR 
process, the autocorrelations in an ARMA process will also decay gradually for 
essentially the same reasons. 


Consider an example regarding sales of an item (y,) and a random shock of advertising 
(e,). We could attempt to forecast sales for this item as a function of the previous 
period’s sales (y,_,), the current level of advertising (¢,), and the one-period lagged level 
of advertising (¢, ,). It makes intuitive sense that sales in the current period could be 


affected by both past sales and by random shocks, such as advertising. Another possible 
random shock for sales could be the seasonal effects of weather conditions. 


PROFESSOR'S NOTE 

° Just as MA models can be extrapolated to the q" observation and AR models 
can be taken out to the p™ observation, ARMA models can be used in the 
format of an ARMA(p,q) model. For example, an ARMA(3,1) model means 
three lagged operators in the AR portion of the formula and one lagged 
operator on the MA portion. This flexibility provides the highest possible set 
of combinations for time series forecasting of the three models discussed in 
this reading. 


Application of AR, MA, and ARMA Processes 


LO 21.i: Describe the application of AR, MA, and ARMA processes. 


LO 21.1: Explain how forecasts are generated from ARMA models. 


A forecaster might begin by plotting the autocorrelations for a data series and find that 
the autocorrelations cut off abruptly. In this case, the forecaster should consider using 
an MA process. If the autocorrelations instead decay gradually, he should consider 
using either an AR process or an ARMA process. The forecaster should especially 
consider these alternatives if he notices periodic spikes in the autocorrelations as they 
are gradually decaying. For example, if every 12" autocorrelation jumps upward, this 
observation indicates a possible seasonality effect in the data and would heavily point 
toward using either an AR or ARMA model. 


Another way of looking at model applications is to test various models using regression 
results. It is easiest to see the differences using data that follows some pattern of 
seasonality, such as employment data. In the real world, a moving average process 
would not specify a very robust model, and autocorrelations would decay gradually, so 
forecasters would be wise to consider both AR models and ARMA models for 
employment data. 


We could begin with a base AR(2) model that adds in a constant value (p) if all other 
values are zero. This is shown in the following generic formula: 


Y; =r Py, I” By, TBE =; 


Applying actual coefficients, our real AR(2) model might look something like: 


y, =101.2413 + 1.4388y, , — 0.4765y, ,+€, 


We could also try to forecast our seasonally impacted employment data with an 
ARMA(3,1) model, which might look like the following formula: 


= À , +> J +> j $ e +e 
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Applying actual coefficients, our real ARMA(3,1) model might look something like: 


y, = 101.1378 + 0.5004y, , + 0.8722y, > — 0.4434y, ,+ 0.9709, , +£, 


In practice, researchers would attempt to determine whether the AR(2) model or the 
ARMA(3,1) model provides a better prediction for the seasonally impacted data series. 


Suppose the researcher settles on ARMA (3,1) model, and suppose the previous three 
values of the time series are as follows: y,_; = 10.38, y, , = 10.14, y, , = 10.50, and suppose 
that previous shock €,_; =—!.23. The forecasted next period value of y would be 
calculated as: 


y, = 101.1378 + (0.5004 x 10.38) + (0.8722 x 10.14) — (0.4434 x 10.50) 


+ (0.9709 x —1.23) = 109.3262 


Sample and Partial Autocorrelations 


LO 21.j: Describe sample autocorrelation and partial autocorrelation. 


Sample autocorrelation and partial autocorrelation are calculated as discussed 
previously, but using sample data. These are used to validate and improve ARMA 
models. Initially, these sample statistics guide the analyst in selecting an appropriate 
model that conforms to the sample data. 


Similarly, in evaluating the goodness of fit of a model, residual autocorrelations at 
different lags are computed. These are then tested for statistical significance. If the 
model fits the sample data well, none of the residual autocorrelations should be 
statistically significantly different from zero. In other words, we are trying to 
determine whether the model has captured all the information, or whether some 
information is still present in the residuals. We discuss formal tests for this in the next 
section. 


Testing Autocorrelations 


LO 21.k: Describe the Box-Pierce Q statistic and the Ljung-Box Q statistic. 


As stated before, model specification checks involve an examination of residual ACF We 
want all residual autocorrelations to be zero. A graphical examination of these 
autocorrelations can provide insights; any autocorrelations violating the 95% 
confidence interval around zero would indicate that the model does not adequately 
capture the underlying patterns in the data. 


A joint test of determining that all residual autocorrelations equal zero versus at least 
one is not equal to zero is the Box-Pierce (BP) statistic: 


h 
Que = TL 
i=l 
where 
Qyp = chi-squared statistic with 4 degrees of freedom 
T = sample size 


r 


i sample autocorrelation at lag f 


For smaller samples (T < 100), a version of the BP statistic known as the Ljung-Box 
(LB) statistic works better: 


Qis = Td (=) A 


Modeling Seasonality in an ARMA 


LO 21.n: Explain how seasonality is modeled in a covariance-stationary ARMA. 


Seasonality in time series data is evidenced by the recurrence of a pattern at the same 
time every year (e.g, higher retail sales in the fourth quarter). For a pure AR process, 
seasonality can be modeled by including a lag corresponding to the seasonality (i.e. 


fourth lag for quarterly data, twelfth lag for monthly data) in addition to any other 
relevant short-term lags. A similar approach is used for MA processes. 


An ARMA model with seasonality is denoted by ARMA (p, q) x (p,, q,), where p, and qs 
denote the seasonal component. In this context, p, and q, are restricted to values of 1 


or 0 (i.e., true or false) such that a value of 1 corresponds to the seasonal lag (e.g., 12 for 
monthly time series). 


=) MODULE QUIZ 21.3 
= 1. Which of the following statements about an autoregressive moving average (ARMA) 
process is correct? 
I. It involves autocorrelations that decay gradually. 
II. It combines the lagged unobservable random shock of the MA process with the 
observed lagged time series of the AR process. 
A. I only. 
B. IT only. 
C. Both I and II. 
D. Neither I nor II. 


2. Which of the following statements is correct regarding the usefulness of an 
autoregressive (AR) process and an autoregressive moving average (ARMA) process 
when modeling seasonal data? 


I. They both include lagged terms and, therefore, can better capture a relationship 
in motion. 
II. They both specialize in capturing only the random movements in time series data. 


A. I only. 

B. IT only. 

C. Both I and ITI. 

D. Neither I nor II. 


3. To test the hypothesis that the autocorrelations of a time series are jointly equal to 
zero based ona small sample, an analyst should most appropriately calculate: 
A. a Ljung-Box (LB) Q-statistic. 
B. a Box-Pierce (BP) Q-statistic. 
C. either a Ljung-Box (LB) or a Box-Pierce (BP) Q-statistic. 
D. neither a Ljung-Box (LB) nor a Box-Pierce (BP) Q-statistic. 


KEY CONCEPTS 


LO 21.a 

To be covariance stationary, a time series must exhibit the following three properties: 
1. Its mean must be stable over time. 

2. Its variance must be finite and stable over time. 


3. Its covariance structure must be stable over time. 


LO 21.b 


The covariance between the current value of a time series and its value t periods in the 
past is referred to as its autocovariance at lag qt. Its autocovariances for all t make up 


its autocovariance function. If a time series is covariance stationary, its autocovariance 
function is stable over time. 


LO 21.c 


White noise is a serially uncorrelated series with a mean of zero and a constant 
variance. If the observations in a white noise process are independent and uncorrelated, 
the process is referred to as independent white noise. If the process also follows a 
normal distribution, it is known as normal white noise or Gaussian white noise. 


LO 21.d, 21.9, 21.m 
An autoregressive (AR) process is specified in the form of a variable regressed against 
itself in lagged form. An AR(p) process is specified as: 


y, = dt By, , + By, , +...+ By, +e, 


Where the absolute values of all ® coefficients should be less than one. The long-run 
(or unconditional) mean reverting level of an AR(p) series is computed as: 
d 


Y {= eH @ 


p 
LO 21.e 
A moving average (MA) process is a linear regression of the current values of a time 


series against both the current and previous unobserved white noise error terms, which 
are random shocks. MAs are always covariance stationary. 


LO 21.f 
A lag operator when applied to a value of a time series yields its lagged value: 


Y,-1 =Ly, 


An autoregressive (AR) process is covariance stationary only if its lag polynomial is 
invertible. This invertibility is used in the Box-Jenkins methodology to select the 
appropriate time series model. 


LO 21.h 

Autoregressive moving average (ARMA) models are used for those time series that 
show signs of both autoregressive (AR) and moving average (MA) processes. An 
ARMA(p,q) indicates p lags in the AR process and q lags in the MA process. 


LO 21.i 

If an autocorrelation plot for a data series cuts off abruptly, the forecaster should 
consider using an MA process. If the autocorrelations instead decay gradually, the 
forecaster should consider specifying either an autoregressive (AR) process or an 
autoregressive moving average (ARMA) process. 


Lo 21.j 


Sample autocorrelations and partial autocorrelations are calculated using sample data 
and are used to validate and improve autoregressive moving average (ARMA) models. 


Initially, these sample statistics guide the analyst in selecting an appropriate model that 
conforms to the sample data. Residual autocorrelations at different lags are tested for 
statistical significance. If the model fits the sample data well, none of the residual 
autocorrelations should be statistically significantly different from zero. 


LO 21.k 
A joint test of determining that all residual autocorrelations equal zero versus at least 
one is not equal to zero is the Box-Pierce (BP) statistic: 


h 
Qar = TL 
i=l 
where: 
Qyp = chi-squared statistic with 4 degrees of freedom 
T = sample size 
q = sample autocorrelation at lag i 


For smaller samples (T < 100), a version of the BP statistic known as the Ljung-Box (LB) 
statistic works better: 


LO 21.1] 


Both autoregressive (AR) and autoregressive moving average (ARMA) processes can be 

applied to time series data that show signs of seasonality. For example, we can forecast 

seasonally impacted data with an ARMA(3,1) model using the following formula: 
VERH Dy, , + Dy, > + Byg + Oe, +e, 


t-l 1 


LO 21.n 


Seasonality in time series data is evidenced by the recurrence of a pattern at the same 
time every year (e.g, higher retail sales in the fourth quarter). For a pure autoregressive 
(AR) process, seasonality can be modeled by including a lag corresponding to the 
seasonality (i.e. fourth lag for quarterly data, twelfth lag for monthly data) in addition 
to any other relevant short-term lags. A similar approach is used for moving average 
(MA) processes. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 21.1 


1.C In theory, a time series can be infinite in length and still be covariance stationary. 
To be covariance stationary, a time series must have a stable mean, a stable 
covariance structure (i.e. autocovariances depend only on displacement, not on 
time), and a finite variance. (LO 21.a) 


2.B One feature that all ACFs have in common is that autocorrelations approach zero 
as the number of lags or displacements gets large. (LO 21.b) 


3.B Ifa white noise process is Gaussian (i.e., normally distributed), it follows that the 
process is independent white noise. However, the reverse is not true; there can be 
independent white noise processes that are not normally distributed. Only those 
serially uncorrelated processes that have a zero mean and constant variance are 
white noise. (LO 21.c) 


Module Quiz 21.2 


1.D In order for an AR process to be covariance stationary, the sum of each of the 
slope coefficients should be less than 1. (LO 21.d) 


2.A A key difference between an MA representation and an AR process is that the MA 
process shows autocorrelation cutoff while an AR process shows a gradual decay 
in autocorrelations. (LO 21.e) 


3.B The coefficient is equal to 0.75, so using the concept derived from the Yule-Walker 
equation, the first-period autocorrelation is 0.75 (i.e., 0.751), and the second- 
period autocorrelation is 0.5625 (i.e, 0.757). (LO 21.) 


4.B There are two main purposes of using a lag operator. First, an AR process is 
covariance stationary only if its lag polynomial is invertible. Second, this 
invertibility is used in the Box-Jenkins methodology to select the appropriate 
time series model. (LO 21.f) 


Module Quiz 21.3 


1.C The ARMA process is important because its autocorrelations decay gradually and 
because it captures a more robust picture of a variable being estimated by 
including both lagged random shocks and lagged observations of the variable 
being estimated. The ARMA model merges the lagged random shocks from the 
MA process and the lagged time series variables from the AR process. (LO 21.h) 


2.A Both AR models and ARMA models are good at forecasting with seasonal patterns 
because they both involve lagged observable variables, which are best for 
capturing a relationship in motion. It is the moving average representation that is 
best at capturing only random movements. (LO 21.1) 


3.A The LB Q-statistic is appropriate for testing this hypothesis based on a small 
sample. (LO 21.k) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 11. 


READING 22 
NON-STATIONARY TIME SERIES 


Study Session 7 


EXAM FOCUS 


The previous reading introduced methods to forecast a covariance stationary time 
series. Next, we will address non-stationary time series. Sources of non-stationarity fall 
into three main categories: time trends, seasonality, and unit roots (random walks). For 
the exam, be prepared to distinguish among these sources and identify the 
recommended approach to resolving them. Series with time trends can often be 
transformed into stationary series by estimating and removing the trend component. 
Seasonality can be modeled with dummy variables or by analyzing year-on-year 
changes. Time series with unit roots should be analyzed in terms of their change from 
the previous period. 


MODULE 22.1: TIME TRENDS 


LO 22.a: Describe linear and nonlinear time trends. 


LO 22.g: Calculate the estimated trend value and form an interval forecast for a 
time series. 


Non-stationary time series may exhibit deterministic trends, stochastic trends, or both. 
Deterministic trends include both time trends and deterministic seasonality (which 
we will address in Module 22.2). Stochastic trends include unit root processes such as 
random walks (which we will address in Module 22.3). 


Time trends may be linear or nonlinear. A series that exhibits a linear time trend is 
one that tends to change by the same amount each period. Graphically, such a series 
resembles deviations around an increasing or decreasing straight line, as in Figure 22.1. 


Figure 22.1: Linear Time Trend 
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A linear time trend can be modeled simply as y, = 59 + 6,t + €,, where €; is a white noise 


process. Note that what makes the series non-stationary is that the observations 
depend on time. 


While linear time trend models benefit from simplicity, they are of limited use in 

finance and economics for two main reasons: 

1. If the trend is downward, a linear model eventually produces negative values, which 
do not make sense when modeling quantities or prices. 


2. Even if the trend is upward, a constant increase in the amount implies a decreasing 
rate of growth over time. Many variables are more accurately modeled as growing at 
a constant rate, rather than by a constant amount. 


Fortunately, modeling techniques are not limited to linear time trends. An example of a 
nonlinear time trend (or polynomial time trend) model is a second-degree or 


quadratic polynomial model: y, = 59 + 6; + 55? + €, Higher-order polynomials can also 
be modeled. 
Many processes in finance and economics can be modeled using a log-linear model. A 


log-linear time trend represents a constant growth rate in a variable. This type of 
model is stated as In (y,) = 59 + 6,t + €,. Graphically, they resemble the time series 


shown in Figure 22.2. 


Figure 22.2: Log-Linear Time Trend 
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As with linear models, log-linear models can be extended to include polynomials, such 
as a log-quadratic model: In (y,) = 5p + 5;t + 82t? + £} 


For a linear or nonlinear model, the trend value can be estimated by regression, as long 
as €, is white noise. If this assumption does not hold, a regression will produce 
misleading indicators of significance (the t-statistics of the coefficients) and goodness 
of fit (the R? of the regression), and a trend model alone is not sufficient to describe the 
time series. 


Once we have estimated a model, we can use it to make forecasts. For example, with a 
simple linear trend model y, = 59 + ôt + £, the forecast for one period ahead (t + 1) 
would be: y;,1 = 59 + 6;(t + 1) + €,,4, and a forecast for h periods ahead would be y;,, = 
ôo + 6,(t + h) + €,,,. With a logarithmic model, forecasting the level of a time series 
requires us to assume €, is normally distributed. 


We can also use the regression results to place a confidence interval around a forecast. 
For example, assuming €, is normally distributed white noise, a 95% confidence interval 


for the forecast h periods ahead is Y} + 1.96 x standard deviation of the regression 
residuals. 


Modeling and removing the trend component results in a detrended time series that we 
may be able to analyze further. Often the detrended time series is covariance stationary 
but not white noise. If so, we can improve on a trend model by using autoregressive 
(AR), moving average (MA), or autoregressive moving average (ARMA) techniques to 
forecast the detrended time series. (We described these techniques in Reading 21.) 


=) MODULE QUIZ 22.1 
= 1. An analyst has determined that monthly vehicle sales in the United States have been 
increasing over the last 10 years, but the growth rate over that period has been 
relatively constant. Which model is most appropriate to predict future vehicle sales? 
A. Linear model. 
B. Quadratic model. 


C. Log-linear model. 
D. Log-quadratic model. 
2. Using data from 2001 to 2020, an analyst estimates a model for an industry's annual 
output as Output; = 80.163 + 4.248t + £+, from a regression with a residual standard 


deviation of 107.574. Assume t equals a given full year (e.g., 2021) and that the error 
term is normally distributed. A 95% confidence interval for a forecast of 2021 
industry output is closest to: 


A. 8,374 to 8,796. 
B. 8,455 to 8,876. 
C. 8,477 to 8,693. 
D. 8,557 to 8,773. 


MODULE 22.2: SEASONALITY 


LO 22.b: Explain how to use regression analysis to model seasonality. 


Seasonality in a time series is a pattern that tends to repeat from year to year. One 
example is monthly sales data for a retailer. Because sales data normally varies 
according to the calendar, we might expect this month’s sales (x,) to be related to sales 


for the same month last year (X,_19). 


Specific examples of seasonality relate to increases that occur at only certain times of 
the year. For example, purchases of retail goods typically increase dramatically every 
year in the weeks leading up to Christmas. Similarly, sales of gasoline generally increase 
during the summer months when people take more vacations. Weather is another 
common example of a seasonal factor as production of agricultural commodities is 
heavily influenced by changing seasons and temperatures. 


Seasonality in a time series can also refer to cycles shorter than a year. For example, a 
daily time series may exhibit deterministic effects on a specific day of the week. We use 
the more general term calendar effects to refer to any cycles that may recur within a 
year or less. 


An effective technique for modeling seasonality is to include seasonal dummy 
variables in a regression. Seasonal dummy variables can take a value of either one or 
zero to represent the season being on or off. For example, in a time series regression of 
monthly stock returns, we might incorporate a January dummy variable that would 
take on the value of one if a stock return occurred in January, and zero if it occurred in 
any other month. The January dummy variable helps us to see if stock returns in 
January were significantly different than stock returns in all other months of the year. 
Many January effect anomaly studies use this type of regression methodology. 


A regression model can include dummy variables for up to one less than the frequency 
of the data. For example, a model for quarterly time series can have up to three seasonal 
dummy variables, and a model for a monthly time series can have up to 11. The “extra” 
period is accounted for by the condition that all the other dummy variables equal zero. 
(If we included a dummy variable for the fourth quarter or the 12th month, we would 
bring multicollinearity into our regression because the value of one dummy variable 
could be predicted exactly from the values of the others.) 


Another approach to modeling seasonality is seasonal differencing. Instead of 
modeling the level of a series, we can model the differences between its level and its 
year-ago level. Seasonal differencing can also help in modeling series with time trends 
and unit roots. 


EXAMPLE: Seasonal dummy variables 


Consider the following regression equation for explaining quarterly earnings per 
share (EPS) in terms of the quarter of their occurrence: 


EPS, = Bo + ByD14 + B2D24 + B3D34 + & 


where: 

EPS, =a quarterly observation of earnings per share 

D,, =1if period tis the first quarter of a year, D4 , = 0 otherwise 
D2, =1if period tis the second quarter of a year, D, į = 0 otherwise 
D3, =1if period tis the third quarter of a year, D3, = 0 otherwise 


The intercept term, Bp, represents the average value of EPS for the fourth quarter. 
The slope coefficient on each dummy variable estimates the difference in EPS (on 
average) between the respective quarter (i.e., quarter one, two, or three) and the 
omitted quarter (the fourth quarter, in this case). Think of the omitted class as the 
reference point. 


Suppose we estimate the quarterly EPS regression model with 10 years of data (40 
quarterly observations) and find that fọ = 1.25, B, = 0.75, B> = -0.20, and B3 = 0.10: 


EPS. = 12 


Ww 


+0.75D, ,— 0.20D, ,+0.10D,, 
Determine the average EPS in each quarter over the past 10 years. 
Answer: 


The average EPS in each quarter over the past 10 years is as follows: 
= average fourth-quarter EPS = 1.25 

= average first-quarter EPS = 1.25 + 0.75 = 2.00 

= average second-quarter EPS = 1.25 - 0.20 = 1.05 

= average third-quarter EPS = 1.25 + 0.10 = 1.35 


These are also the model’s predictions of future EPS in each quarter of the following 
year. For example, to use the model to predict EPS in the first quarter of the next 


year, set D,, = 1,D,, = 0, and D,, = 0. Then EPS, = 1.25 + 0.75(1) - 0.20(0) + 
0.10(0) = 2.00. This simple model uses average EPS for a specific quarter over the 


past 10 years as the forecast of EPS in its respective quarter of the following year. 


LO 22.f: Explain how to construct an h-step-ahead point forecast for a time 
series with seasonality. 


Forecasting a seasonal series is fairly straightforward. A pure seasonal dummy variable 
model can be constructed as follows: 


After adding a time trend, the model can then take the following form: 
i fea 3, (9) + Ea (Di ) tE 
1 


We can expand the forecasting model even further by allowing for other calendar 
effects. For example, if we suspect a time series exhibits holiday variations (HDV) and 
trading-day variations, we can account for them with additional dummy variables: 
= J (t) + > "(Di ) TET JoHo ( HDV, ) + È add | TDV |, ) re, 
i=l i=l 


=l 
This complete model can now be used for out-of-sample forecasts at time T + h by 
constructing an h-step-ahead point forecast as follows: 
Yren = B(T +h) + È ~ (Di Tan )+ ya’ ( HDV; +41) 
, i=1 i=l 


Y, 
E U i ae )+ Eth 
1= 


That is, determine the value for the time trend at time T + h and set the dummy 
variables to their appropriate 0 or 1 values for the period T + h. 


=) MODULE QUIZ 22.2 


1. Jill Williams is an analyst in the retail industry. She is modeling a company's sales and 
has noticed a quarterly seasonal pattern. If Williams includes an intercept term in 
her model, how many dummy variables should she use to model the seasonality 
component? 

A. 1. 
B. 2. 
C. 3. 
D. 4. 


2. Consider the following regression equation utilizing dummy variables for explaining 
quarterly EPS in terms of the quarter of their occurrence: 
EPS, = Bg +8,D;, +8202, +8303, + E 
where: 
EPS, = a quarterly observation of EPS 
D;, = 1 if period tis the first quarter, D, , = 0 otherwise 
D5, = 1 if period tis the second quarter, D}; = 0 otherwise 
D,, = 1 if period tis the third quarter, D,, = 0 otherwise 


The intercept term Bg represents the average value of EPS for the: 


A. first quarter. 
B. second quarter. 
C. third quarter. 


D. fourth quarter. 


3. A model for the change ina retailer's quarterly sales, using seasonal dummy variables 
DQ, is estimated as: 


ASales, =49- 2.1DQ2 = 3.8DQ3 + 6.5DQ4 
In the third quarter, sales are forecast to: 


A. decrease by 3.8. 
B. decrease by 1.0. 
C. increase by 1.1. 

D. increase by 3.8. 


MODULE 22.3: UNIT ROOTS 


LO 22.c: Describe a random walk and a unit root. 


We describe a time series as a random walk if its value in any given period is its 
previous value plus-or-minus a random “shock.” Symbolically, we state this as y, = y;_1 


+ Ep 


This seems simple enough, but if y, = y;_; + €, it follows logically that the same was 
true in earlier periods, y;_; = Yi_2 + €t-p Yt2 = Yt-3 + €f-2 and so forth, all the way back 
to the beginning of the time series: y4 = yo + £4. 


If we substitute these (recursively) back into y, = y;_1 + €& we eventually get: y,=yo + 
€, tE +... + E2 + &1 + £p That is, any observation in the series is a function of the 


beginning value and all the past shocks, as well as the shock in the observation’s own 
period. 


A key property of a random walk is that its variance increases with time. This implies a 
random walk is not covariance stationary, so we cannot model one directly with AR, 
MA, or ARMA techniques. 


A random walk is a special case of a wider class of time series known as unit root 
processes. They are called this because when expressed using lag polynomials (which 
we introduced in Reading 21), one of their roots is equal to 1, as in: (1 - L)(1 - 0.65L) y, 
= Et 


PROFESSOR'S NOTE 

“A unit root process is sometimes described as a random walk with drift. For 
our purposes here, we can think of random walk and unit root more-or-less 
interchangeably. 


LO 22.d: Explain the challenges of modeling time series containing unit roots. 


If we attempt to model a time series directly when it has a unit root, we run into three 
main problems: 


1. Unlike stationary time series, a series with a unit root does not revert to a mean. 


2. Time series with unit roots often show spurious relationships with each other. 


3. If we use an ARMA model, its estimated parameters follow an asymmetric 
distribution that depends on the sample size and the presence of a time trend (a 
Dickey-Fuller distribution). This reduces our ability to select a correct model or 
make valid forecasts. 


All of these problems can be addressed by modeling the first differences of a unit root 
time series, which is to say their changes from one period to the next. In fact, modeling 
first differences also can address time trends and seasonality. 


If the first differences are not a stationary series, we can take the differences of those 
(i.e double differencing). However, if we already have a stationary series, taking its 
differences results in overdifferencing, which adds complexity to a forecasting model 
instead of reducing its complexity. 


LO 22.e: Describe how to test if a time series contains a unit root. 


The most common way to test a series for a unit root is with an augmented Dickey- 
Fuller test. This is essentially a test of whether the lagged level of a series is a 
statistically significant factor in a regression model. That model can also include 
deterministic factors and lagged differences of the series, as appropriate. Such a model 
may be stated as: 


AY, 7 WY, I T ĉo ni byt + MAY, 1 Fent AY, P t 4 


where the ôs represent deterministic factors and the As represent lagged differences. 
The model should include just enough of these to make £, a white noise process. 


The null hypothesis is that y, the coefficient on the lagged value Y,_;, is equal to zero. If 


we fail to reject the hypothesis, the lagged level of the series has no predictive value 
and the series is a random walk. If the series is covariance stationary, then y will be 
significantly less than zero. (If y is significantly greater than zero, the series is not 
covariance stationary because it’s an explosive process rather than a random walk.) So 
although the null hypothesis is y = 0, the alternative hypothesis is y < 0, and not y # 0. 


2) MODULE QUIZ 22.3 
= 1. A random walk is most accurately described as a time series whose value is a function 
of its: 
A. previous value only. 
B. beginning value only. 
C. previous value and a random shock. 
D. beginning value and all historical shocks. 


2. An augmented Dickey-Fuller test will reject the hypothesis that a process is a unit 
root if the coefficient on the lagged value is statistically significantly: 
A. less than zero. 
B. equal to zero. 
C. greater than zero. 
D. different from zero. 


KEY CONCEPTS 


LO 22.a 


A time series that tends to grow by a constant amount each period has a linear trend. A 
time series that tends to grow at a constant rate each period has a nonlinear trend. 


LO 22.b 

A regression model can account for seasonality by introducing dummy variables to 
represent seasonal effects. To avoid multicollinearity, the number of dummy variables 
must be one less than the number of periods in a year (e.g., three dummy variables for 
quarterly data). 


LO 22.c 

A time series is a random walk if its value in any given period is its previous value plus- 
or-minus a random “shock.” A random walk is not covariance stationary. Random walks 
are a special case of a wider class known as unit root processes, called this because 
when expressed using lag polynomials, one of their roots is equal to one. 


LO 22.d 

Unlike stationary time series, a series with a unit root does not revert to a mean. Time 
series with unit roots often show spurious relationships. Model parameters for a unit 
root series follow a Dickey-Fuller distribution, which reduces our ability to select a 
correct model or make valid forecasts. All of these problems can be addressed by 
modeling the differences of the unit root series. 


LO 22.e 


The most common way to test a series for a unit root is with an augmented Dickey- 
Fuller test. This is a test of whether the lagged level is a statistically significant factor in 
a regression model. The null hypothesis is that the coefficient on the lagged value is 
equal to zero. The alternative hypothesis, however, is that the coefficient is less than 
zero, not different from zero. 


LO 22.f 

Given a model with a time trend and seasonal dummy variables, we can construct an h- 
step-ahead point forecast by determining the value of the time trend at time T + h and 
setting the dummy variables to their appropriate 0 or 1 values for the period T + h. 


LO 22.g 


Assuming a model’s forecast error is a normally distributed white noise process, a 95% 
confidence interval for the forecast h periods ahead is y,,), + 1.96 x standard deviation 
of the regression residuals. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 22.1 


1.C A log-linear model is most appropriate for a time series that grows at a relatively 
constant growth rate. (LO 22.a) 


2.B Fort = 2021, a point forecast for industry output is 80.163 + 4.248(2021) = 


8,665.371. A 95% confidence interval is 8,665.371 + 1.96(107.574) = 8,454.526 to 
8,876.216. (LO 22.g) 


Module Quiz 22.2 


1.C Whenever we want to distinguish between s seasons in a model that incorporates 
an intercept, we must use s - 1 dummy variables. For example, if we have 


quarterly data, s = 4, and thus we would include s - 1 = 3 seasonal dummy 
variables. (LO 22.b) 


2.D The intercept term represents the average value of EPS for the fourth quarter. The 
slope coefficient on each dummy variable estimates the difference in EPS (on 


average) between the respective quarter (i.e., quarter one, two, or three) and the 
omitted quarter (the fourth quarter, in this case). (LO 22.b) 


3.C ASalesg3 = 4.9 - 2.1(0) - 3.8(1) + 6.5(0) = 1.1 (LO 22) 


Module Quiz 22.3 


1.D For a random walk, y, = Yo + £1 + E2 + ... + E2 + E1 + Ep SO its value at time tis a 


function of its beginning value and all shocks, as well as the shock in the 
observation’s own period. (LO 22.c) 


2.A Although the null hypothesis is that the coefficient on the lagged value is equal to 
zero, the rejection condition is that the coefficient is less than zero. (LO 22.e) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 12. 


READING 23 


MEASURING RETURNS, VOLATILITY, 
AND CORRELATION 


Study Session 7 


EXAM FOCUS 


Traditionally, volatility has been synonymous with risk. Thus, the accurate estimation 
of volatility is crucial to understanding potential risk exposure. For the exam, 
understand how to calculate simple and continuously compounded returns and 
recognize differences between definitions of volatility. Since financial returns tend to 
follow nonnormal distributions, it is important to understand the properties of this 
distribution, how to test for this type of distribution, and what the tails look like in this 
distribution. This reading closes with the concepts of correlations and dependence and 
how to test for them using various methods. 


MODULE 23.1: DEFINING RETURNS AND VOLATILITY 


Simple and Continuously Compounded Returns 


LO 23.a: Calculate, distinguish, and convert between simple and continuously 
compounded returns. 


Returns on investments are often expressed as simple returns and continuously 
compounded returns. A simple return can be expressed over various periods of time, 
spanning from a single hour to a full year. Across multiple time periods, an asset’s 
return can be calculated by taking the product of each period’s simple return. 


Assuming an asset (priced at P) is purchased at time t - 1 and sold at time t, the simple 
return (R) is equal to: 
P, P, 


I 
R, = P 


t-1 
Continuously compounded (log) returns (r) can be calculated using the following 
formula: 


= InP, InP, i 


For log returns, summing single period log returns produces a multiple period total 
return with the following formula: 


For shorter time horizons, log returns are more appropriate than simple returns. Log 
returns also do not accurately approximate simple returns when the simple return is 
large. 


The following equation can be used to convert between the simple (R) and log return 
(r), with the simple return always exceeding the log return: 


1+R,=expr, 


Volatility, Variance, and Implied Volatility 


LO 23.b: Define and distinguish among volatility, variance rate, and implied 
volatility. 


The volatility of a variable, o, is expressed as the standard deviation of its returns. The 
variance (or variance rate) of an asset is expressed as 0”. Volatility, along with the 
mean (u) and a shock (e,) with a mean of zero and a variance of one, can be used to 
derive the return (r) on an asset using the following formula: 

tr, = + oe, 
For basic modeling, the return calculated across multiple periods is the sum of the 
individual returns. The mean of a weekly return, assuming returns are calculated daily, 


is 5u; the weekly variance is 507, and the weekly volatility is v5 o. The annualized 
volatility (using monthly returns to measure volatility) is calculated as: 


_ >> 2 
= xX o 
annual y 12 “ monthly 


o 


Assuming 252 trading days in the calendar year and volatility measured daily, the 
annualized volatility is calculated as: 


= W57 > 2 
co = |Z Xo 
annual yare daily 


Options are used to calculate implied volatility, which is an annual volatility number 
that can be measured by backing into it using option prices. The Black-Scholes-Merton 
(BSM) model used to calculate the price of a call option includes inputs for the current 
asset price, strike price, time to maturity, risk-free interest rate, and annual variance. As 
long as the option price is known, the other variables except the annual variance are all 
observable and can be used to back into the variance number. The model’s inherent 
assumption that variance is constant over time is one drawback to using this approach 
to calculate implied volatility. 


The VIX Index is used to measure implied volatility for the S&P 500 for a prospective 
period covering 30 calendar days. The methodology uses option prices with future 
expiration dates and multiple strike prices, and, therefore, serves as a forward-looking 


volatility measure. The VIX method, which requires a significant and liquid derivatives 
market, can be computed for assets like equity indices, U.S. Treasury bonds, 
commodities, and individual stocks. 


=) MODULE QUIZ 23.1 
= 1. Assuming a simple return of 5.00%, the log return will be closest to: 
A. 4.88%. 
B. 5.00%. 
C. 5.05%. 
D. 5.13%. 


2. Which of the following statements is correct in regard to using the Black-Scholes- 
Merton (BSM) pricing model to calculate implied volatility? 


A. The option price is not needed for the calculation. 
B. Variance is assumed to remain constant over time. 
C. Time to maturity is not one of the components of the calculation. 
D. The current asset price has to remain constant in the calculation. 


MODULE 23.2: NORMAL AND NONNORMAL 
DISTRIBUTIONS 


LO 23.c: Describe how the first two moments may be insufficient to describe 
non-normal distributions. 


The first two moments of a probability density function are its mean and variance, 
respectively. The third moment is skewness and the fourth moment is kurtosis. For a 
normal distribution, which has thin tails and is symmetric, there is no skewness or 
excess kurtosis. However, the reality is that financial returns often follow a nonnormal 
distribution, so there is skewness and excess kurtosis. 


When examining returns for the S&P 500, the Japanese Yen (JPY)/U.S. Dollar (USD) 
exchange rate, and gold over a period of time, each of these assets exhibits skewness 
that is not equal to zero and each exhibits kurtosis larger than three (implying positive 
excess kurtosis). For the first two assets, the skewness is negative while gold returns 
reflect positive skewness. 


Jarque-Bera Test 


LO 23.d: Explain how the Jarque-Bera test is used to determine whether returns 
are normally distributed. 


The Jarque-Bera (JB) test statistic can be used to test whether a distribution is 
normal, meaning that there is zero skewness and no excess kurtosis (K - 3 = 0). If 
skewness is S and kurtosis is K, the null and alternative hypotheses are as follows: 


Null: 
Hy: S=0 and K=3 


Alternative: 


Hy:S#0andK#3 
The test statistic, where T is the sample size, is: 


e? (£-3) 
JB = (T~ i) 


6 24 


Because both the skewness and kurtosis components of the equation are asymptotically 
normally distributed and uncorrelated, each has a chi-squared distribution (x) such 
that the JB is approximately x$. Smaller values will indicate that the null hypothesis is 
likely true, while larger values are indicative of a null that is likely to be rejected. At 
5% and 1% respectively, and with critical values of 5.99 and 9.21, the null will be 
rejected if the JB calculation is above these levels. It is often the case that the longer the 
time period for measurement, the more likely it is that the JB statistic is smaller and 
the financial return approximates a normal distribution. 


EXAMPLE: Test the hypothesis that a fund’s returns follow a normal 
distribution 


Assume 60 monthly returns are sampled for a fund with skewness equal to 0.30 
(slight positive skewness) and kurtosis equal to 3.50. Using a 95% confidence 
interval, the critical value for the chi-squared distribution with two degrees of 
freedom equals 5.99 (i.e., corresponds to the 95th percentile of the chi-squared 
distribution). 


Construct a test of the hypothesis that the fund’s returns follow a normal 
distribution. 


Answer: 


The JB statistic for this fund equals: 
0.37 oe) 


JB =(59 $ 
oh” 24 


= 59(0.015+ 0.0104) = 1.5 
The JB statistic for this fund (1.5) is less than the critical value (5.99). The statistic 


does not lie in the rejection area to the right of 5.99. Therefore, we would fail to 
reject the hypothesis that this fund’s returns follow a normal distribution. 


The Power Law 


LO 23.e: Describe the power law and its use for non-normal distributions. 


As financial returns tend to follow nonnormal distributions, studying the tails can help 
explain how returns are distributed in reality. In a normal distribution with a kurtosis 
of three (and excess kurtosis of zero), the tails are thin. For other distributions, the tails 
do not decline as quickly. Some of these distributions (including the Student’s t- 
distribution) have power law tails, implying that the probability of seeing a return 


larger than a specific value of x (with constants: k and a) is equal to: P(X > x) = kx “. Fat 
tails with slow declines are found in power law tails and distributions like the Student’s 
t-distribution, which explains why observations away from the mean are more common 
than those found in normal distributions. 


2) MODULE QUIZ 23.2 
= 1. Relative to a normal distribution, financial returns tend to have a nonnormal 
distribution, which will have: 
A. thin tails. 
B. kurtosis greater than three. 
C. minimal to no skewness. 
D. a symmetrical distribution. 


2. Which of the following statements regarding the Jarque-Bera (JB) test statistic is 
most accurate? 


A. The null hypothesis states that skewness does not equal zero. 

B. The alternative hypothesis states that kurtosis is equal to three. 

C. The alternative hypothesis is likely to be rejected when the JB statistic is high. 
D. The null hypothesis is likely to not be rejected when the JB statistic is very small. 


3. Which of the following statements is most accurate regarding power law tails? 
A. More observations tend to be closer to the mean. 
B. The standard normal distribution exhibits power law tails. 
C. The tails exhibit faster declines than normally distributed tails. 
D. They tend to have “fatter” tails than those found in a normal distribution. 


MODULE 23.3: CORRELATIONS AND DEPENDENCE 


LO 23.f: Define correlation and covariance and differentiate between correlation 
and dependence. 


LO 23.g: Describe properties of correlations between normally distributed 
variables when using a one-factor model. 


LO 23.h: Compare and contrast the different measures of correlation used to 
assess dependence. 


Random variables can either be independent or dependent. If they are independent, the 
product of their marginal densities will equal their joint density per the equation: 


fry Gy) =R). 


Diversification benefits increase and tail risk decreases when variables are 
independent. However, financial assets tend to be highly dependent from both a linear 
and nonlinear perspective. Pearson’s correlation serves as a method of measuring 
linear dependence. Correlation represents the linear relationship between two 
variables, while covariance represents the directional relationship between two 
variables. 


Regression is the link tying correlation to linear dependence. In the standard regression 
equation Y; = a + BX; + £;, if Y and X are standardized such that they each have a 


variance of one (termed, unit variance), the correlation will be equal to the regression 
slope (b). 


When dependence is nonlinear, there is no one statistic used to measure it. To measure 
nonlinear dependence, measures such as Spearman’s rank correlation and Kendall’s 
qt (tau) can be used. The values for both must lie between -1 and 1, they are each zero 
when the returns are completely independent, they are scale invariant, and both are 
positive (negative) based on the increasing (decreasing) relationship between the 
variables. 


Spearman's Rank Correlation 


Spearman’s correlation is a linear correlation estimator which is applied to ranks of 
observations. The strength of the linear relationship between ranks, as opposed to the 
linear relationship between the variables, drives rank correlation. In a situation where 
two random variables (X and Y) have n associated observations, Rank, and Rank, serve 
as the ranks of the variables. One equates to the smallest value of each variable, two the 
second smallest, three the third smallest, and the trend continues on until n serves as 
the largest rank. The equation for the correlation estimator is: 


Cov [Rank yRank, | 
YV Ranky \V Rank, 
Assuming ranks are distinct, and Ranky; - Ranky, represents the difference (d;) in ranks 
for the same observation, the following equation can be used to express the estimator: 
6 $ (a) 
i=} 

het. a 
n(n — 1) 
The correlation will be close to 1 when highly ranked values of X and Y are paired 
together. However, when the largest values of one variable are grouped with the 
smallest values of another, the variables will have strong negative dependence, the 
difference will be large, and correlation will be close to -1. If the variables themselves 
have a strong linear relationship, rank and linear correlation will be similar. If there are 
large differences in linear and rank correlations, there is likely to be a key nonlinear 
relationship. Rank correlation, unlike linear correlation, is not as sensitive or 
vulnerable to outliers because ranks rather than variable values are used. 


EXAMPLE: Spearman’s rank correlation 


Calculate the Spearman rank correlation for the returns of stocks X and Y provided 
in the following table. 


Returns for Stocks X and Y 


Year x Y 

2011 25.0% —20.0% 
2012 —20.0% 10.0% 
2013 40.0% 20.0% 
2014 —10.0% 30.0% 


Answer: 


The calculations for determining the Spearman rank correlation coefficient are 
shown in the table below. The first step involves ranking the returns for stock X from 
lowest to highest in the second column. The first column denotes the respective year 
for each return. The returns for stock Y are then listed for each respective year. The 
fourth and fifth columns rank the returns for variables X and Y. The differences 
between the rankings for each year are listed in column six. Lastly, the sum of 
squared differences in rankings is determined in column 7. 


Ranking Returns for Stocks X and Y 


Year xX Y X Rank YRank d; a? 
2012 —20.0% 10.0% l 2 l l 
2014 —10.0% 30.0% 2 4 =) 4 
2011 25.0% —20.0% 3 I 2 4 


2013 40.0% 20.0% 


wae 


3 l 1 
Sum = 10 


The Spearman rank correlation coefficient can then be determined as 0.0: 


6 F d? 
2 i 6x10 


= 1 —-—- ales 08 
n(n* — 1) 4(16 — 1) 


KENDALL'S t 


Kendall’s t is used to measure concordant and discordant pairs and their relative 
frequency. The measure represents the difference between the probabilities of 
concordance and discordance. 


To see how this is applied, take two random variables (X;,Y\) and (X;,Y;). If (X; < X,) and 
(Y; < Yj), the relative positions of X and Y are in agreement and the pair is concordant. If 
the orders are different, the pair will be discordant. If X; = X; and Y; = Y, the pair is 


neither concordant nor discordant. Random variables with many concordant pairs tend 
to have strong positive dependence, whereas variables with many discordant pairs tend 
to have a strong negative relationship. 


The equation for calculating Kendall’s Tt is: 


n- ny n- ny 


n(n—1)/2 ,+nyt+n, nn .+n,+0, 
where: 

n. = number of concordant pairs 

n, = number of discordant pairs 


n, = number of ties 


If all pairs are concordant, the output will equal exactly 1. If all pairs are discordant, 
the output will equal -1. Any other pattern will produce a number between -1 and 1. 


EXAMPLE: Kendall's tT 


Calculate the Kendall t correlation coefficient for the stock returns of X and Y listed 
below. 


Ranked Returns for Stocks X and Y 


Year X é X Rank Y Rank 
2012 —20.0% 10.0% l 2 
2014 —10.0% 30.0% 2 4 
2011 25.0%  —20.0% 3 l 
2013 40.0% 20.0% 4 3 


Answer: 


Begin by comparing the rankings of X and Y stock returns in columns four and five of 
the table above. There are four pairs of observations, so there will be six 
combinations. The following table summarizes the pairs of rankings based on the 
stock returns for X and Y. There are three concordant pairs and three discordant 
pairs. 


Categorizing Pairs of Stock X and Y Returns 


Concordant Discordant 
Pairs Pairs 
((1,2),(2,4)}  4(1,2),(3,1)} 
{(1,2),(4,3)}  4(2,4),(3,1)} 
((3,1),(4,3)}}  {4(2,4),(4,3)} 


Kendall’s t can then be determined as 0: 
z n.— Ny z 3—3 5 


n(n—1)/2 4(4—1)/2 


Thus, there is no positive or negative relationship between the stock returns of X and 
Y based on the Kendall t correlation coefficient. 


Positive Definiteness 

When all random variables have unit variance (variance equal to 1), the correlation 
matrix and covariance matrix are the same thing. Every linear combination of random 
variables must have a variance which is non-negative. Positive definiteness, defined as 


every weighted average combination having a positive variance, requires that the 
variance of an average of components in a covariance matrix must be positive. 


In order to ensure that correlation matrices are positive definite, two structured 
correlations are typically used. The first type, known as equicorrelation, sets all 
correlations equal to the same amount. The second type applies a structure which 
assumes that correlations are due to a common factor exposure, thereby making the 
correlation for any entries into the matrix equal to p; j = Y,Y; and each entry having a 


correlation between -1 and 1. 


=) MODULE QUIZ 23.3 
- 1. An analyst calculates a Spearman's rank correlation of 0.48. This output is indicative 
of: 
A. positive linear correlation. 
B. negative linear correlation. 
C. positive nonlinear dependence. 
D. negative nonlinear dependence. 
2. Which of the following situations is indicative of equicorrelation in a correlation 
matrix? 
A. Correlations which are all equal to 1. 
B. Variables with correlations other than O. 
C. Variables with negative coefficients of determination. 
D. Three variables with a correlation with one another of 1.25. 


KEY CONCEPTS 


LO 23.a 

Investment returns can be stated as both simple and continuously compounded (log) 
returns. With simple returns, an asset’s return across multiple time periods is 
calculated by taking the product of each period’s simple return. For continuously 
compounded returns, an asset’s return across multiple time periods is calculated by 
taking the sum of each single period’s log returns. Log returns, which are always less 
than simple returns, are more often used for shorter time horizons. The equation 1 + R, 
= exp r, can be used to convert between simple (R) and log (r) returns. 


LO 23.b 

The volatility of a variable, o, is expressed as the standard deviation of its returns. The 
variance (or variance rate) of an asset is expressed as ,2.. Options are used to calculate 
implied volatility, which is an annual volatility number that can be measured by 
backing into it using option prices. All of the variables included in the Black-Scholes- 
Merton (BSM) model used to calculate call option prices are observable except the 
annual variance, meaning that the variance value can be derived as long as the price of 
the option is known. The VIX Index is a forward-looking methodology used to measure 
implied volatility for the S&P 500 for a prospective period covering 30 calendar days. 


LO 23.c 

The first two moments of a probability density function are its mean and variance, 
which are used to describe a normal distribution. The third moment is the skewness 
and the fourth moment is the kurtosis. For a normal distribution, which has thin tails 
and is symmetric, there is no skewness or excess kurtosis. Financial returns often 
follow a nonnormal distribution and as such, there is skewness and excess kurtosis. 


LO 23.d 

The Jarque-Bera (JB) test statistic can be used to test whether a distribution is normal, 
meaning that there is zero skewness and no excess kurtosis (K - 3 = 0). If the result falls 
below the critical value, the null will not be rejected and the distribution will be 
deemed normal. If the result is above the critical value, the null will be rejected. 
Financial returns are likely to follow a more normal distribution over longer time 
periods. 


LO 23.e 

Normal distributions have a kurtosis of three (and excess kurtosis of zero) and thin 
tails. For other types of distributions, the tails do not decline as quickly. Fat tails with 
slow declines are found in power law tails and distributions like the Student’s t- 
distribution, which explains why observations away from the mean are more common 
than those found in normal distributions. 


LO 23.f and 23.h 

Correlation represents the linear relationship between two variables, while covariance 
represents the directional relationship between two variables. Random variables can 
either be independent or dependent. Pearson’s correlation serves as a method of 
measuring linear dependence. 


When dependence is nonlinear, measures such as Spearman’s rank correlation and 
Kendall’s , (tau) can be used. As with traditional correlation measures, the values for 


both must lie between -1 and 1. 


Spearman’s correlation is a linear correlation estimator which is applied to ranks of 
observations. The strength of the linear relationship between ranks, as opposed to the 
linear relationship between the variables, drives rank correlation. The correlation will 
be close to 1 when highly ranked values of X and Y are paired together. However, when 
the largest values of one variable are grouped with the smallest values of another, the 
variables will have strong negative dependence, the difference will be large, and 
correlation will be close to -1. 


Kendall’s , is used to measure concordant and discordant pairs and their relative 
frequency. The measure represents the difference between the probabilities of 
concordance and discordance. If all pairs are concordant, the output will equal exactly 
1. If all pairs are discordant, the output will equal -1. Any other pattern will produce a 
number between -1 and 1. 


LO 23.g 

Every linear combination of random variables must have a variance which is non- 
negative. Positive definiteness, defined as every weighted average combination having a 
positive variance, requires that the variance of an average of components ina 
covariance matrix must be positive. In order to ensure that correlation matrices are 
positive definite, two structured correlations are typically used. The first type, known 
as equicorrelation, sets all correlations equal to the same amount. The second type 
applies a structure which assumes that correlations are due to a common factor 
exposure. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 23.1 


1.A The equation to convert the simple return to the log return is: 


1+R,=expr, 

Plugging in values, 1.05 = exp r, Taking the natural log of each side to isolate the log return (r) 
results in In 1.05 = 0.0488 or 4.88%. 

(LO 23.a) 


2.B One of the drawbacks to using the BSM pricing model to derive implied volatility 
is that variance must remain constant over time. The option price and time to 
maturity are both needed for the calculation, but there is no requirement that the 
current underlying asset price has to remain constant. (LO 23.b) 


Module Quiz 23.2 


1.B A nonnormal distribution is likely to have either positive or negative skewness 
and a kurtosis that is different from three. A normal distribution has thin tails, 
kurtosis equal to three, no skewness, and a symmetrical distribution. (LO 23.c) 


2.D When the JB test statistic is very small, the null hypothesis is likely to not be 
rejected. When the statistic is high, the null is likely to be rejected (with the 
alternative hypothesis not being rejected). The null hypothesis states that 
skewness is zero and kurtosis is three (with excess kurtosis therefore equal to 
zero). The alternative hypothesis states that skewness is not equal to zero and 
kurtosis is not equal to three. (LO 23.d) 


3.D Power law tails tend to be “fatter” than the tails found in normal distributions. 
Power law tails reflect more observations found farther away from the mean and 
they tend to exhibit slower declines than the tails in normal distributions. (LO 
23.e) 


Module Quiz 23.3 


1.C Correlation will be between -1 and 1. Any number above 0 is going to represent a 
positive output. Because rank correlation is used to measure nonlinear 
dependence, an output of 0.48 indicates positive nonlinear dependence. (LO 23.f) 


2.A Ifall of the variables in a correlation matrix have correlations of 1, this is 
indicative of equicorrelation. They can have correlations of zero, as long as all are 
equal. Variables cannot have negative coefficients of determination (which are 
correlations squared) and correlations can never be greater than 1. (LO 23.g) 


The following is a review of the Quantitative Analysis principles designed to address the learning objectives set 
forth by GARP®. Cross-reference to GARP FRM Part I Quantitative Analysis, Chapter 13. 


READING 24 
SIMULATION AND BOOTSTRAPPING 


Study Session 7 


EXAM FOCUS 


Simulation methods model uncertainty by generating random inputs that are assumed 
to follow an appropriate probability distribution. This reading discusses the basic steps 
for conducting a Monte Carlo simulation and compares this simulation method to the 
bootstrapping technique. For the exam, be able to explain ways to reduce Monte Carlo 
sampling error, including the use of antithetic and control variates. Also, understand the 
pseudo-random number generation method. Finally, be able to describe the advantages 
and disadvantages of the bootstrapping technique in comparison to the traditional 
Monte Carlo approach. 


MODULE 24.1: MONTE CARLO SIMULATION AND 
SAMPLING ERROR REDUCTION 


LO 24.a: Describe the basic steps to conduct a Monte Carlo simulation. 


Monte Carlo simulations are often used to model complex problems or to estimate 

variables when the sample size is small. A few practical finance applications of Monte 

Carlo simulations include pricing exotic options, estimating the impact to financial 

markets of changes in macroeconomic variables, and examining capital requirements 

under stress-test scenarios. 

There are five basic steps used to conduct a simulation: 

1. Generate random draw data x; = [X}j, X2\ -.» Xni]. For a Monte Carlo process, this data 
is drawn from an assumed data generating process (DGP). 

2. Calculate the statistic or function of interest, g; = g(x;). 

3. Repeat Steps 1 and 2 to produce N replications. 

4, Estimate the quantity of interest from {g}, 82, ..., gp}. 


5. Evaluate the accuracy by computing the standard error. N should be increased until 
the required level of accuracy is achieved. 


The first step of conducting a simulation requires generating random inputs that are 
assumed to follow a specific probability distribution. 


The second step of the simulation generates scenarios or trials based on randomly 
generated inputs drawn from a pre-specified probability distribution. The most 
common probability distribution used is the standard normal distribution. However, 
Student’s ¢-distribution is often used if the user believes it is a better fit for the data. A 
well-defined simulation model requires the generation of variables that follow 
appropriate probability distributions. 


The third and fourth steps in the simulation process allow for data analysis related to 
the properties of the probability distributions of the output variables. In other words, 
rather than making just one output estimate for a problem, the model generates a 
probability distribution of estimates. This provides the user with a better 
understanding of the range of possible outcomes. 


In step five, the quantity N is the number of replications or iterations and is typically 
performed 1,000 to 10,000 times, depending on how costly it is to generate the sample 
size. 


For example, suppose we are managing an investment portfolio and desire to estimate 
the ending capital in the portfolio in one year, C4. The initial capital investment, Co, is 
$100 invested in the S&P 500. The return is a random variable that depends on how the 
market performs over the next year. 


If we assume the return over the next year is equal to a historical mean return, we can 
calculate one point estimate of the ending capital based on the equation: C, = C,(1 + r). 
The return over the next period is a random variable, and a simulation model estimates 
multiple scenarios to represent future returns based on a probability distribution of 
possible outcomes. The output variable is an estimate of an ending amount of capital 
that is also a random variable. The simulation model allows us to visualize the output 
and analyze the probability distribution of the ending capital amounts generated by the 
model. 


Reducing Monte Carlo Sampling Error 


LO 24.b: Describe ways to reduce Monte Carlo sampling error. 


The sampling error for a Monte Carlo simulation is quantified as the standard error 
estimate. The standard error of the true expected value is computed as s/ YN wheres is 
the standard deviation of the output variables and N is the number of scenarios or 
replications in the simulation. Based on this equation, it intuitively follows that to 
reduce the standard error estimate by a factor of 10, the analyst must increase N by a 
factor of 100. (Because the square root of 100 is 10, if we increase the sample size 100 
times, it will reduce the standard error estimate by dividing by 10.) 


Suppose we continue the illustration from the previous example and run a simulation 
to estimate the ending capital amount for an initial investment portfolio of $100. The 
number of replications is initially 100 (i.e, N = 100), resulting in a mean ending capital 


of $110 and a standard deviation of $14.80. For this example, the standard error 
estimate is computed as $1.48 (i.e., $14.80 / 10). Now, suppose we want to increase the 
accuracy by reducing the standard error estimate. How can we increase the accuracy of 
the simulation? 


The accuracy of simulations depends on the standard deviation and the number of 
scenarios run. We cannot control the standard deviation, but we can control the 
number of replications. Assume we rerun the previous simulation with 400 replications 
that results in the same mean ending capital of $110, and the standard deviation 
remains at $14.80. The standard error estimate for the simulation with 400 replications 
is then $0.74 (i.e., 14.80 / 20). With four times the number of scenarios (4 x N, or 400, in 
this example) the standard error estimate is cut in half to $0.74. In other words, 
quadrupling the number of scenarios will improve the accuracy twofold. 


However, increasing the number of generated scenarios can become costly for more 
complex multi-period simulations. Variance reduction techniques offer an alternative 
way to reduce the sampling error of a Monte Carlo simulation. The two most 
commonly used techniques for reducing the standard error estimate are antithetic 
variates and control variates. 


Antithetic Variates 


LO 24.c: Explain the use of antithetic and control variates in reducing Monte 
Carlo sampling error. 


One reason sampling error occurs is because there are often a wide range of possible 
outcomes for a particular experiment or problem. Thus, to replicate the entire range of 
possible outcomes, the sampling sets must be recreated numerous times. However, 
increasing the number of samples drawn may be costly and time consuming. As an 
alternative approach, the antithetic variate technique can reduce Monte Carlo 
sampling error by rerunning the simulation using a complement set of the original set 
of random variables. 


If the original set of random draws is denoted u, for each replication, then the 
simulation is rerun with the complement set of random numbers denoted -u,. The use 
of antithetic variates should result in a lower covariance and variance, because the two 
sets are perfectly negatively correlated [i.e., corr(u, —u,) = -1]. The following example 
illustrates how the standard error for a Monte Carlo simulation is reduced by using the 
antithetic variate technique. 


First, consider a simulation of two sets that does not use the antithetic variate 
technique. Suppose the average parameter estimate is determined by two Monte Carlo 
simulations using different random sample sets. The average output parameter value x 
for the two simulations using different random sample replications is simply calculated 
as: 


x = (x, +x,) {2 


where x, and x, are the average output parameter values for simulation sets one and 
two, respectively. 


Next, we can calculate the variance of the average of the two sets as follows: 


var(x,) + var(x,) + 2cov(x, X3) 
var{x) = 


4 
Without using antithetic variates, the two sets of Monte Carlo replications are 
independent. Thus, the covariance will be zero and the variance of x is simply reduced 
to the following: 
var( X4) + var(x,) 


var(x) = 


4 
The use of antithetic variates results in negative covariance between the original 
random draws and their complements (i.e., antithetic variates). This negative 
relationship means that the Monte Carlo sampling error must always be smaller using 
this approach. 


Control Variates 


The control variate technique is a widely used method to reduce the sampling error 
in Monte Carlo simulations. A control variate involves replacing a variable x (under 
simulation) that has unknown properties, with a similar variable y that has known 
properties. 


Suppose two separate simulations are conducted on variable x with unknown 
properties and control variable y with known properties using the same set of random 
numbers. Also assume that the Monte Carlo simulation estimated variables for x and y 
are denoted as g and Ẹ, respectively. The original estimate for x can be redefined as x* 
as follows: 


x* = y+(X—9) 
The new x* variable estimate will have a smaller sampling error than the original x 
variable if the control statistic and statistic of interest are highly correlated. The Monte 
Carlo results for the new x* variable are assumed to have similar properties to the 
known y control variable. 


The following mathematical equations help illustrate the condition that is necessary to 
reduce the sampling error using control variates. Consider taking the variance of both 
sides of the equation that defines the new variable such that: 


var(x*) = var[y + (£ —9)] 


The control variable y does not have a sampling error because it has known properties. 
Thus, var(y) equals zero. Now, the variance of the remaining two variables can be 
rewritten as follows: 


var(x*) = varí) + var(}) — 2cov(X,9) 


The control variate method will only reduce the sampling error in Monte Carlo 
simulations if var(x*) is less than &. Another way of expressing this condition is as 


follows: 


vary) — 2cov(2,¥) < 0 
This relationship can be simplified as follows: 
var(ŷ) 


coví x ) > 


-~ 


The covariance can be converted to correlation by dividing both sides of the previous 
inequality by the product of the standard deviations as follows: 


1 |var(¥) 
conf?) > -I 
i 2 \var(X) 
A practical financial example of applying control variates is the use of Monte Carlo 
simulations in pricing Asian options. An Asian option is priced based on the average 
value of the underlying asset over the lifespan of the option. The use of a similar 
derivative, such as a European option with known statistical properties, can be used as 
a control variate. The price of the European option PBS is determined by the Black- 
Scholes-Merton option pricing model. Next, simulated prices are determined for the 
Asian option and the European option and denoted P, and Pgs*, respectively. The new 


estimate of the Asian option price P,* can then be determined based on the following 
equation: 


P) = (P, — Pps) + P 


BS 


=) MODULE QUIZ 24.1 


1. Which of the following statements regarding Monte Carlo simulation is least 
accurate? When using Monte Carlo simulation: 
A. simulated data is used to numerically approximate the expected value of a 
function. 
B. the user specifies a complete data generating process (DGP) that is used to 
produce simulated data. 
C. the observed data are used directly to generate a simulated data set. 
D. a full statistical model is used that includes an assumption about the distribution 
of the shocks. 


2. Suppose an analyst is concerned about Monte Carlo sampling error. Based on an initial 
Monte Carlo simulation with 100 replications, the results indicated a standard 
deviation of 12.64. The simulation was rerun with 900 replications and the standard 
deviation remained at 12.64. What are the standard error estimates for the 
simulations with 100 replications and 900 replications, respectively? 


N= 100 N = 900 
A. 0.126 0.014 
B. 0.126 0.140 
C. 1.264 0.421 
D. 1.264 0.214 


3. A concern for Monte Carlo simulations is the size of the sampling error. One way to 
reduce the sampling error is to use the antithetic variate technique. Which of the 
following statements best describes this technique? 

A. The simulation is rerun using a complement set of the original set of random 
variables. 


B. The number of replications is increased significantly to reduce sampling error. 

C. Sample data is replaced after every replication to ensure it has an equal 
probability of being redrawn. 

D. The data generating process (DGP) is approximated by redefining the unknown 
variable with a variable that has known properties. 


MODULE 24.2: BOOTSTRAPPING AND RANDOM 
NUMBER GENERATION 


The Bootstrapping Method 


LO 24.d: Describe the bootstrapping method and its advantage over Monte Carlo 
simulation. 


Another way to generate random numbers is the bootstrapping method. The 
bootstrapping approach draws random return data from a sample of historical data. 
Unlike the Monte Carlo simulation method, bootstrapping uses actual historical data 
instead of random data from a probability distribution. Furthermore, bootstrapping 
repeatedly draws data from the historical data set and replaces the data so it can be 
drawn again. 


Unlike Monte Carlo simulation, bootstrapping does not directly model the observed 
data, nor does it make assumptions about the distribution of the data. Rather, the 
observed data is sampled directly from the unknown distribution. 


There are two commonly used classes of bootstraps: independent and identically 
distributed (i.i.d.) and circular block bootstrap (CBB). 


Independent and Identically Distributed (i.i.d.) 

The first bootstrapping approach that we will consider is the i.i.d. bootstrap. In this 
methodology, samples are simply drawn one-by-one from the observed data, and 
replaced. 


If we require a simulation of sample size of three from a data set with a total of 10 
observations, the i.i.d. bootstrap generates observation indices by randomly sampling 
three times with replacement from the values {1, 2, ..., 10}. These indices indicate the 
observed data to be included in the simulated (i.e., bootstrap) sample. 


For example, suppose that the three observations are drawn from a sample of 10 data 
points {x,, X2 ..., X10}. The first simulation might include observations {X, X7, Xo}, the 


second simulation {xv, Xs, X10}, and the third {x,, X4, Xg}. Note that the first two 
simulated samples overlap (both contain x,), which is possible because the i.i.d. 
bootstrapping method samples with replacement. Notice also that the third simulation 
sample includes the same observation (x4) twice. This too is a result of sampling with 
replacement. 


The i.i.d. bootstrap methodology is valid when the observations are independent. 
However, in finance it is often the case that data is dependent across time; for example, 


volatility tends to be high during some periods and low during other periods. 


Circular Block Bootstrap (CBB) 

When observations are not independent, a more sophisticated bootstrapping method 
than i.i.d. is required. One such method is the CBB method. CBB differs from i.i.d. in that 
rather than sampling single observations, the CBB method samples blocks of 
observations. The CBB method is used to produce bootstrap samples by sampling 
blocks, with replacement, until the required bootstrap sample size is produced. 


For example, suppose that 10 observations are available and they are sampled in blocks 
of size three. Ten blocks are constructed, starting with {x,, Xz, X3}, {X9, X3, X4}, {Xg Xo, 


X10) {Xo X10 X1} {X10 Xp X2}. Notice that the first eight blocks use three consecutive 
observations, but the final two blocks wrap around. 


The block size used in the CBB methodology should be large enough to reflect the 
dependence in the data. However, the block size should not be so large that the number 
of blocks becomes small. For a sample size of n, a block size of Vn is generally 
appropriate. 


LO 24.f: Describe situations where the bootstrapping method is ineffective. 


While bootstrapping is a useful statistical technique, it has its limitations. There are 
two specific issues that arise when using a bootstrap: 


1. Using the entire data set may not be reliable: As long as current market conditions 
are normal, using the complete historical data set is beneficial. On the other hand, 
when the current condition of financial markets is different from its usual state, the 
bootstrapping method may be ineffective. For example, using a bootstrap to estimate 
the value at risk during the financial crisis of 2007-2009 would produce an 
unrealistically low view of risk, because volatility was historically much lower. 


N 


. Structural changes: Another limitation of the bootstrapping method is that there 
may have been recent permanent fundamental changes in the market. For example, 
interest rates on U.S. T-bills were near-zero for a decade beginning in 2008—a 
condition that had never previously occurred over a long period. As a result, 
bootstrapping using older historical data would be ineffective in replicating this 
period. 


Random Number Generation 


LO 24.e: Describe pseudo-random number generation. 


Random number generators are used to produce an irregular sequence of numerical 
values. Algorithms used to generate these random sequences are referred to as pseudo- 
random number generators (PRNGs). The term pseudo implies that these computer- 
generated numbers are not truly random: they are actually generated from a formula. 


PRNGs typically produce sequences of random numbers uniformly distributed between 
zero and one. Each number should have an equal probability of being drawn from the 


uniform (0,1) distribution. 


To produce pseudo-random numbers, an initial seed value must first be chosen. The 
choice of seed value will determine the random number sequence that is generated. In 
fact, any particular seed value will generate an identical set of values each time the 
PRNG is run. 


The recurring nature of PRNG outputs provides us with two benefits: 


1. Repeatability: Because a particular seed value will always produce the same series of 
random values, we can replicate the sequence across several different experiments, 
which allows multiple alternative models to be estimated using the same simulated 
data. Furthermore, the use of a specific initial seed allows simulation results to be 
reproduced later—which may be required for regulatory compliance. 

2. Computing Clusters: Suppose that we are using a group of computers to model 
complex portfolios containing thousands of financial instruments that are all 
impacted by the same set of fundamental factors. Using a common seed value allows 
us to use the same set of random numbers across multiple simulations. Starting each 
PRNG in a cluster with the same seed allows each simulation to make use of the 
same values when studying the joint behavior of the instruments in the portfolio. 


Disadvantages of Simulation Approaches 


LO 24.g: Describe the disadvantages of the simulation approach to financial 
problem solving. 


Disadvantages of the simulation approach to financial problem solving include: 


1. Specification of the DGP: Even with a large number of simulation iterations, when the 
assumptions of model inputs or the data generating process are unrealistic, 
imprecise results may occur. Alternate assumptions made in the DGP may lead to 
substantially different results. A common model misspecification relates to 
assumptions about the underlying probability distribution of inputs: for example, 
option prices are typically fat-tailed, but a model could erroneously draw option 
prices from a normal distribution. This would lead to inaccurate results, regardless 
of the number of replications. 


2. Computational cost: The best way to reduce the variation of simulation results is to 
use a large number of replications. If estimated parameters are complex, the 
computations may take an extremely long time to run. Some problems may require a 
large number of replications to obtain acceptable results; it is common to use at 
least 10,000 replications in Monte Carlo simulations. Computer processor times 
have improved exponentially, however. The complexity of markets and issues that 
are examined have also become increasingly complex, potentially leading to high 
computation costs. 


=) MODULE QUIZ 24.2 


1. Which of the following statements regarding the bootstrapping method is least 
accurate? Bootstrapping simulations: 


A. draw data from historical data sets. 
B. replace drawn data so it can be redrawn. 


C. require assumptions with respect to the true distribution of the parameter 
estimates. 


D. rely on the key assumption that the present resembles the past. 
2. Which of the following statements regarding the pseudo-random number generation 
method is least accurate? Pseudo-random numbers are: 
A. not truly random. 
B. actually generated from a formula. 
C. determined by the choice of the initial seed value. 
D. impossible to predict. 


3. The bootstrapping method is most likely to be effective when the: 
A. data contains outliers. 
B. present is different from the past. 
C. data is independent. 
D. markets have experienced structural changes. 


4. Monte Carlo simulation is a widely used technique in solving economic and financial 


problems. Which of the following statements is least likely to represent a limitation 
of the Monte Carlo technique when solving problems of this nature? 
A. High computational costs arise with complex problems. 
B. Simulation results are experiment-specific because financial problems are 
analyzed based on a specific data generating process (DGP) and set of equations. 
C. Results of most Monte Carlo experiments are difficult to replicate. 
D. If the input variables have fat tails, Monte Carlo simulation is not relevant 
because it always draws random variables from a normally distributed population. 


KEY CONCEPTS 


LO 24.a 

A Monte Carlo simulation uses observations to estimate key model parameters, such as 
the mean and standard deviation. A complete data generating process (DGP) is created 
by combining these parameters with an assumption about the distribution of the 
standardized returns. 

The basic steps of a Monte Carlo simulation are: 

1. Generate data according to the assumed DGP. 

2. Calculate the function or statistic of interest. 

3. Repeat steps one and two to produce N replications. 

4, Estimate the quantity of interest. 


5. Assess the accuracy by computing the standard error, and increase N until the 
required accuracy is achieved. 


LO 24.b 


The standard error estimate of a Monte Carlo simulation, s/ YN, can be reduced by a 
factor of 10 by increasing N by a factor of 100, where quantity N is the number of 
replications or iterations. 


LO 24.c 


Antithetic variables and control variates can be used simultaneously to reduce the 
approximation error in a Monte Carlo simulation. 


With antithetic variables, random values are constructed to generate negative 
correlation within the values used in the simulation. Variance is reduced because the 
covariance between the simulated values is negative, so the variance of the sum is less 
than the sum of the variances. 


Control variates reduce the variance of the approximation by adding values with a 
mean of zero that are correlated to the simulation. 


LO 24.d 


Bootstrapping simulations repeatedly draw data from historical data sets, each time 
replacing the data so it can be redrawn. The bootstrapping technique requires no 
assumptions with respect to the true distribution of the parameter estimates. 


LO 24.e 


Pseudo-random numbers are not truly random, as they are actually generated from a 
formula. The choice of the initial seed value determines the random numbers that are 
generated. 


The reproducibility of outputs from pseudo-random number generators (PRNGs) 
allows results to be replicated across multiple experiments, or to be generated on 
multiple computers. 


LO 24.f 
Two primary limitations arise when using the bootstrapping method: 


1. This method may not be reliable if current financial market conditions differ from 
their normal state. 


2. Structural changes may have occurred in markets so that current conditions are 
different from anything that has occurred in the past. 


LO 24.g 
Two disadvantages of the simulation approach to financial problem solving include: 


1. Specification of the data generating process (DGP): When the assumptions of model 
inputs or the data generating process are unrealistic, inaccurate results may occur. 


2. Computational cost: While computer processing times have decreased, markets have 
also become increasingly complex, potentially leading to high computation costs. 


ANSWER KEY FOR MODULE QUIZZES 


Module Quiz 24.1 


1.C In both Monte Carlo simulation and bootstrapping, the goal is to numerically 
approximate the expected value of a complex function through the use of 
computer-generated values (i.e., simulated data). The main difference between 


Monte Carlo simulation and bootstrapping is the source of the simulated data: in 
Monte Carlo simulation, the user specifies a complete DGP that is used to produce 
the simulated data, while in bootstrapping, the observed data are used directly to 
generate the simulated data set—without specifying a complete DGP. (LO 24.a) 


2.C The standard error is determined by dividing the standard deviation by the square 
root of the number of replications s/ YN. The standard error estimate for the first 
simulation of 100 replications is 1.264 (i.e., 12.64 / 10). With 900 replications, the 
standard error estimate is reduced to 0.4213 (i.e., 12.64 / 30). (LO 24.b) 


3.A The antithetic variate technique reduces Monte Carlo sampling error by 
rerunning the simulation using a complement set of the original set of random 
variables. (LO 24.c) 


Module Quiz 24.2 


1.C The bootstrapping technique does not require any assumptions with respect to 
the true distribution of the parameter estimates. Bootstrapping simulations 
repeatedly draw data from historical data sets, and then replace the data so it can 
be redrawn. The bootstrapping method is only as valid as the assumption that the 
present resembles the past. (LO 24.d) 


2.D Pseudo-random numbers appear random because they are difficult to predict. 
However, they are produced by deterministic functions that are complex rather 
than truly random. The initial choice of a seed value determines the series of 
random numbers that is generated. (LO 24.e) 


3.C The bootstrapping method is most likely to be effective when the data is 
independent and there are no outliers in the data. Bootstrapping uses the entire 
data set to generate a simulated sample, so the bootstrapping method should be 
reliable if the current state of the financial market is the same as its normal state, 
meaning that no structural changes have taken place. (LO 24.f) 


4.D A disadvantage of Monte Carlo simulations is that imprecise results may occur 
when the assumptions of model inputs or DGP are unrealistic. The distribution of 
input variables does not need to be the normal distribution. Problems will arise if 
a real-world variable is fat-tailed, but the model erroneously draws option prices 
from a normal distribution. (LO 24.g) 


